Greetings, I am setting up our new GPU cluster, and I seem to have a problem configuring things so that the devices are properly walled off via cgroups. Our nodes each of two GPUS; however, if --gres is unset, or set to --gres=gpu:0, I can access both GPUs from inside a job. Moreover, if I ask for just 1 GPU then unset the CUDA_VISIBLE_DEVICES environmental variable, I can access both GPUs. From my understanding, this suggests that it is *not* being protected under cgroups.
I've read the documentation, and I've read through a number of threads where people have resolved similar issues. I've tried a lot of configurations, but to no avail. Below I include some snippets of relevant (current) parameters; however, I also am attaching most of our full conf files. [slurm.conf] ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory JobAcctGatherType=jobacct_gather/linux AccountingStorageTRES=gres/gpu GresTypes=gpu NodeName=evc1 CPUs=32 RealMemory=191917 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN NodeAddr=ivc1 Weight=1 Gres=gpu:2 [gres.conf] NodeName=evc[1-10] Name=gpu File=/dev/nvidia0 COREs=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30 NodeName=evc[1-10] Name=gpu File=/dev/nvidia1 COREs=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31 [cgroup.conf] ConstrainDevices=yes [cgroup_allowed_devices_file.conf] /dev/null /dev/urandom /dev/zero /dev/sda* /dev/cpu/*/* /dev/pts/* Thanks, Paul.
cgroup_allowed_devices_file.conf
Description: Binary data
cgroup.conf
Description: Binary data
gres.conf
Description: Binary data
slurm.conf
Description: Binary data