Shawn, Just to give you a compare and contrast:
We have for related entries slurm.conf JobAcctGatherType=jobacct_gather/linux # will migrate to cgroup eventually JobAcctGatherFrequency=30 ProctrackType=proctrack/cgroup TaskPlugin=task/affinity,task/cgroup cgroup_allowed_devices_file.conf: /dev/null /dev/urandom /dev/zero /dev/sda* /dev/cpu/*/* /dev/pts/* /dev/nvidia* gres.conf (4 K80s on node with 24 core haswell): Name=gpu File=/dev/nvidia0 CPUs=0-5 Name=gpu File=/dev/nvidia1 CPUs=12-17 Name=gpu File=/dev/nvidia2 CPUs=6-11 Name=gpu File=/dev/nvidia3 CPUs=18-23 I also looked for multi-tenant jobs on our MARCC cluster with jobs > 1 day and they are still inside of cgroups, but again this is on CentOS6 clusters. Are you still seeing cgroup escapes now, specifically for jobs > 1 day? Thanks, Kevin From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Shawn Bobbin <sabob...@umiacs.umd.edu> Reply-To: Slurm User Community List <slurm-users@lists.schedmd.com> Date: Monday, April 23, 2018 at 2:45 PM To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] Jobs escaping cgroup device controls after some amount of time. Hi, I attached our cgroup.conf and gres.conf. As for the cgroup_allowed_devices.conf file, I have this file stubbed but empty. In 17.02 slurm started fine without this file (as far as I remember) and it being empty doesn’t appear to actually impact anything… device availability remains the same. Based on the behavior explained in [0] I don’t expect this file to impact specific GPU containment. TaskPlugin = task/cgroup ProctrackType = proctrack/cgroup JobAcctGatherType = jobacct_gather/cgroup [0] https://bugs.schedmd.com/show_bug.cgi?id=4122