$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=none -p GPU 
/usr/bin/env |grep CUDA
CUDA_VISIBLE_DEVICES=0,1

This result should be CUDA_VISIBLE_DEVICES=NoDevFiles, and it really is 
NoDevFiles in 17.02. So this must be a bug in 17.11.7.


From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Brian W. 
Johanson
Sent: Thursday, August 30, 2018 11:23 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] [External] Re: serious bug about CUDA_VISBLE_DEVICES 
in the slurm 17.11.7


and to answer "CUDA_VISBLE_DEVICES can't be set NoDevFiles in Slurm 17.11.7"

CUDA_VISIBLE_DEVICES is unset if --gres=none and if set in the user's 
environment, it will remains set to whatever.  If you want really want to see 
NoDevFIles, set it in /etc/profile.d, it will get clobbered when the resources 
are actually there.



$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=none -p GPU 
/usr/bin/env |grep CUDA
CUDA_VISIBLE_DEVICES=0,1
$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=none -p GPU nvidia-smi
No devices were found


$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=gpu:1 -p GPU 
/usr/bin/env |grep CUDA
CUDA_VISIBLE_DEVICES=0
$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=gpu:1 -p GPU 
nvidia-smi |grep Tesla | wc
      1      11      80
$


$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=gpu:2 -p GPU 
/usr/bin/env |grep CUDA
CUDA_VISIBLE_DEVICES=0,1
$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=gpu:2 -p GPU 
nvidia-smi |grep Tesla | wc
      2      22     160
$



On 08/30/2018 10:48 AM, Renfro, Michael wrote:

Chris’ method will set CUDA_VISIBLE_DEVICES like you’re used to, and it will 
help keep you or your users from picking conflicting devices.



My cgroup/GPU settings from slurm.conf:



=====



[renfro@login ~]$ egrep -i '(cgroup|gpu)' /etc/slurm/slurm.conf | grep -v '^#'

ProctrackType=proctrack/cgroup

TaskPlugin=task/affinity,task/cgroup

NodeName=gpunode[001-004]  CoresPerSocket=14 RealMemory=126000 Sockets=2 
ThreadsPerCore=1 Gres=gpu:2

PartitionName=gpu Default=NO MinNodes=1 DefaultTime=1-00:00:00 
MaxTime=30-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 
DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF 
ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO 
ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP 
Nodes=gpunode[001-004]

PartitionName=gpu-debug Default=NO MinNodes=1 MaxTime=00:30:00 AllowGroups=ALL 
PriorityJobFactor=2 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO 
Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 
AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=NO 
OverTimeLimit=0 State=UP Nodes=gpunode[001-004]

PartitionName=gpu-interactive Default=NO MinNodes=1 MaxNodes=2 MaxTime=02:00:00 
AllowGroups=ALL PriorityJobFactor=3 PriorityTier=1 DisableRootJobs=NO 
RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO 
DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO 
OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=gpunode[001-004]

GresTypes=gpu,mic



=====



Example (where srun is a function that runs “srun --pty $SHELL -I”), with no 
CUDA_VISIBLE_DEVICES on the submit host, but is correctly set on reserving GPUs:



=====



[renfro@login ~]$ echo $CUDA_VISIBLE_DEVICES



[renfro@login ~]$ hpcshell --partition=gpu-interactive --gres=gpu:1

[renfro@gpunode003 ~]$ echo $CUDA_VISIBLE_DEVICES

0

[renfro@login ~]$ hpcshell --partition=gpu-interactive --gres=gpu:2

[renfro@gpunode004 ~]$ echo $CUDA_VISIBLE_DEVICES

0,1



=====



On Aug 30, 2018, at 4:18 AM, Chaofeng Zhang 
<zhang...@lenovo.com><mailto:zhang...@lenovo.com> wrote:



CUDA_VISBLE_DEVICES is used by many AI framework to determine which gpu to use, 
like tensorflow. So this environment is critical to us.



-----Original Message-----

From: slurm-users 
<slurm-users-boun...@lists.schedmd.com><mailto:slurm-users-boun...@lists.schedmd.com>
 On Behalf Of Chris Samuel

Sent: Thursday, August 30, 2018 4:42 PM

To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>

Subject: [External] Re: [slurm-users] serious bug about CUDA_VISBLE_DEVICES in 
the slurm 17.11.7



On Thursday, 30 August 2018 6:38:08 PM AEST Chaofeng Zhang wrote:



The CUDA_VISBLE_DEVICES can't be set NoDevFiles in Slurm 17.11.7.

This is worked when we use Slurm 17.02.



You probably should be using cgroups instead to constrain access to GPUs.

Then it doesn't matter what you set CUDA_VISBLE_DEVICES to be as processes will 
only be able to access what they requested.



Hope that helps!

Chris

--

Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC













Reply via email to