Thanks Andy,
I've been able to confirm that in my case, any jobs that ran for at least
30 minutes (puppet's run interval) would lose their cgroups, and that the
time those cgroups disappear corresponds exactly with puppet runs. I am not
sure if this is cgroup change to root is what causes the oom
> On 30 Apr 2018, at 22:37, Nate Coraor wrote:
>
> Hi Shawn,
>
> I'm wondering if you're still seeing this. I've recently enabled task/cgroup
> on 17.11.5 running on CentOS 7 and just discovered that jobs are escaping
> their cgroups. For me this is resulting in a lot of jobs ending in
> OU
PUs=6-11
>>
>> Name=gpu File=/dev/nvidia3 CPUs=18-23
>>
>>
>>
>>
>>
>> I also looked for multi-tenant jobs on our MARCC cluster with jobs > 1
>> day and they are still inside of cgroups, but again this is on CentOS6
>> cluster
nant jobs on our MARCC cluster with jobs > 1 day
> and they are still inside of cgroups, but again this is on CentOS6 clusters.
>
>
>
> Are you still seeing cgroup escapes now, specifically for jobs > 1 day?
>
>
>
> Thanks,
>
> Kevin
>
>
>
>
>
>
>
m User Community List
Date: Monday, April 23, 2018 at 2:45 PM
To: Slurm User Community List
Subject: Re: [slurm-users] Jobs escaping cgroup device controls after some
amount of time.
Hi,
I attached our cgroup.conf and gres.conf.
As for the cgroup_allowed_devices.conf file, I have this file stubbed but
Community List <slurm-users@lists.schedmd.com>Date: Thursday, April 12, 2018 at 9:25 AMTo: "slurm-us...@schedmd.com" <slurm-us...@schedmd.com>Subject: [slurm-users] Jobs escaping cgroup device controls after some amount of time. Hi, We’re running slurm 17.11.5 on RHEL 7 and
gpu users inside of cgroup,
but not multi-tenant currently. 17.11.5, CentOS6
From: slurm-users on behalf of Shawn
Bobbin
Reply-To: Slurm User Community List
Date: Thursday, April 12, 2018 at 9:25 AM
To: "slurm-us...@schedmd.com"
Subject: [slurm-users] Jobs escaping cgroup device contro
Hi,
We’re running slurm 17.11.5 on RHEL 7 and have been having issues with jobs
escaping there cgroup controls on GPU devices.
For example we have the following steps running:
# ps auxn | grep [s]lurmstepd
0 2380 0.0 0.0 538436 3700 ?Sl 07:22 0:02 slurmstepd:
[46609.0]