I also remember there being write-only permissions involved when working with cgroups and devices .. which bent my head slightly..
On Thu, 30 Aug 2018 at 17:02, John Hearns <hear...@googlemail.com> wrote: > Chaofeng, I agree with what Chris says. You should be using cgroups. > > I did a lot of work with cgroups anf GPUs in PBSPro (yes I know... > splitter!) > With cgroups you only get access to the devices which are allocated to > that cgroup, and you get CUDA_VISIBLE_DEVICES set for you. > > Remember also to look at the permissions on /dev/nvidia(0,1,2...) - > which are usually OK > and on /dev/nvidiactl > > > > > On Thu, 30 Aug 2018 at 15:52, Renfro, Michael <ren...@tntech.edu> wrote: > >> Chris’ method will set CUDA_VISIBLE_DEVICES like you’re used to, and it >> will help keep you or your users from picking conflicting devices. >> >> My cgroup/GPU settings from slurm.conf: >> >> ===== >> >> [renfro@login ~]$ egrep -i '(cgroup|gpu)' /etc/slurm/slurm.conf | grep >> -v '^#' >> ProctrackType=proctrack/cgroup >> TaskPlugin=task/affinity,task/cgroup >> NodeName=gpunode[001-004] CoresPerSocket=14 RealMemory=126000 Sockets=2 >> ThreadsPerCore=1 Gres=gpu:2 >> PartitionName=gpu Default=NO MinNodes=1 DefaultTime=1-00:00:00 >> MaxTime=30-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 >> DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 >> PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL >> LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP >> Nodes=gpunode[001-004] >> PartitionName=gpu-debug Default=NO MinNodes=1 MaxTime=00:30:00 >> AllowGroups=ALL PriorityJobFactor=2 PriorityTier=1 DisableRootJobs=NO >> RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO >> DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO >> OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=gpunode[001-004] >> PartitionName=gpu-interactive Default=NO MinNodes=1 MaxNodes=2 >> MaxTime=02:00:00 AllowGroups=ALL PriorityJobFactor=3 PriorityTier=1 >> DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 >> PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL >> LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP >> Nodes=gpunode[001-004] >> GresTypes=gpu,mic >> >> ===== >> >> Example (where srun is a function that runs “srun --pty $SHELL -I”), with >> no CUDA_VISIBLE_DEVICES on the submit host, but is correctly set on >> reserving GPUs: >> >> ===== >> >> [renfro@login ~]$ echo $CUDA_VISIBLE_DEVICES >> >> [renfro@login ~]$ hpcshell --partition=gpu-interactive --gres=gpu:1 >> [renfro@gpunode003 ~]$ echo $CUDA_VISIBLE_DEVICES >> 0 >> [renfro@login ~]$ hpcshell --partition=gpu-interactive --gres=gpu:2 >> [renfro@gpunode004 ~]$ echo $CUDA_VISIBLE_DEVICES >> 0,1 >> >> ===== >> >> > On Aug 30, 2018, at 4:18 AM, Chaofeng Zhang <zhang...@lenovo.com> >> wrote: >> > >> > CUDA_VISBLE_DEVICES is used by many AI framework to determine which gpu >> to use, like tensorflow. So this environment is critical to us. >> > >> > -----Original Message----- >> > From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of >> Chris Samuel >> > Sent: Thursday, August 30, 2018 4:42 PM >> > To: slurm-users@lists.schedmd.com >> > Subject: [External] Re: [slurm-users] serious bug about >> CUDA_VISBLE_DEVICES in the slurm 17.11.7 >> > >> > On Thursday, 30 August 2018 6:38:08 PM AEST Chaofeng Zhang wrote: >> > >> >> The CUDA_VISBLE_DEVICES can't be set NoDevFiles in Slurm 17.11.7. >> >> This is worked when we use Slurm 17.02. >> > >> > You probably should be using cgroups instead to constrain access to >> GPUs. >> > Then it doesn't matter what you set CUDA_VISBLE_DEVICES to be as >> processes will only be able to access what they requested. >> > >> > Hope that helps! >> > Chris >> > -- >> > Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC >> > >> > >> > >> > >> > >> >>