Re: [slurm-users] Jobs escaping cgroup device controls after some amount of time.

2018-05-01 Thread Nate Coraor
Thanks Andy, I've been able to confirm that in my case, any jobs that ran for at least 30 minutes (puppet's run interval) would lose their cgroups, and that the time those cgroups disappear corresponds exactly with puppet runs. I am not sure if this is cgroup change to root is what causes the oom

Re: [slurm-users] Jobs escaping cgroup device controls after some amount of time.

2018-04-30 Thread Andy Georges
> On 30 Apr 2018, at 22:37, Nate Coraor wrote: > > Hi Shawn, > > I'm wondering if you're still seeing this. I've recently enabled task/cgroup > on 17.11.5 running on CentOS 7 and just discovered that jobs are escaping > their cgroups. For me this is resulting in a lot of jobs ending in > OU

Re: [slurm-users] Jobs escaping cgroup device controls after some amount of time.

2018-04-30 Thread Nate Coraor
PUs=6-11 >> >> Name=gpu File=/dev/nvidia3 CPUs=18-23 >> >> >> >> >> >> I also looked for multi-tenant jobs on our MARCC cluster with jobs > 1 >> day and they are still inside of cgroups, but again this is on CentOS6 >> cluster

Re: [slurm-users] Jobs escaping cgroup device controls after some amount of time.

2018-04-30 Thread Nate Coraor
nant jobs on our MARCC cluster with jobs > 1 day > and they are still inside of cgroups, but again this is on CentOS6 clusters. > > > > Are you still seeing cgroup escapes now, specifically for jobs > 1 day? > > > > Thanks, > > Kevin > > > > > > >

Re: [slurm-users] Jobs escaping cgroup device controls after some amount of time.

2018-04-23 Thread Kevin Manalo
m User Community List Date: Monday, April 23, 2018 at 2:45 PM To: Slurm User Community List Subject: Re: [slurm-users] Jobs escaping cgroup device controls after some amount of time. Hi, I attached our cgroup.conf and gres.conf. As for the cgroup_allowed_devices.conf file, I have this file stubbed but

Re: [slurm-users] Jobs escaping cgroup device controls after some amount of time.

2018-04-23 Thread Shawn Bobbin
Community List <slurm-users@lists.schedmd.com>Date: Thursday, April 12, 2018 at 9:25 AMTo: "slurm-us...@schedmd.com" <slurm-us...@schedmd.com>Subject: [slurm-users] Jobs escaping cgroup device controls after some amount of time. Hi,  We’re running slurm 17.11.5 on RHEL 7 and

Re: [slurm-users] Jobs escaping cgroup device controls after some amount of time.

2018-04-13 Thread Kevin Manalo
gpu users inside of cgroup, but not multi-tenant currently. 17.11.5, CentOS6 From: slurm-users on behalf of Shawn Bobbin Reply-To: Slurm User Community List Date: Thursday, April 12, 2018 at 9:25 AM To: "slurm-us...@schedmd.com" Subject: [slurm-users] Jobs escaping cgroup device contro

[slurm-users] Jobs escaping cgroup device controls after some amount of time.

2018-04-12 Thread Shawn Bobbin
Hi, We’re running slurm 17.11.5 on RHEL 7 and have been having issues with jobs escaping there cgroup controls on GPU devices. For example we have the following steps running: # ps auxn | grep [s]lurmstepd 0 2380 0.0 0.0 538436 3700 ?Sl 07:22 0:02 slurmstepd: [46609.0]