and to answer "CUDA_VISBLE_DEVICES can't be set NoDevFiles in Slurm 17.11.7"

CUDA_VISIBLE_DEVICES is unset if --gres=none and if set in the user's environment, it will remains set to whatever.  If you want really want to see NoDevFIles, set it in /etc/profile.d, it will get clobbered when the resources are actually there.


$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=none -p GPU /usr/bin/env |grep CUDA
*CUDA_VISIBLE_DEVICES=0,1*
$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=none -p GPU nvidia-smi
*No devices were found*


$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=gpu:1 -p GPU /usr/bin/env |grep CUDA*
**CUDA_VISIBLE_DEVICES=0*
$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=gpu:1 -p GPU nvidia-smi |grep Tesla | wc
*     1      11      80*
$


$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=gpu:2 -p GPU /usr/bin/env |grep CUDA
*CUDA_VISIBLE_DEVICES=0,1*
$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=gpu:2 -p GPU nvidia-smi |grep Tesla | wc
*      2      22     160*
$



On 08/30/2018 10:48 AM, Renfro, Michael wrote:
Chris’ method will set CUDA_VISIBLE_DEVICES like you’re used to, and it will 
help keep you or your users from picking conflicting devices.

My cgroup/GPU settings from slurm.conf:

=====

[renfro@login ~]$ egrep -i '(cgroup|gpu)' /etc/slurm/slurm.conf | grep -v '^#'
ProctrackType=proctrack/cgroup
TaskPlugin=task/affinity,task/cgroup
NodeName=gpunode[001-004]  CoresPerSocket=14 RealMemory=126000 Sockets=2 
ThreadsPerCore=1 Gres=gpu:2
PartitionName=gpu Default=NO MinNodes=1 DefaultTime=1-00:00:00 
MaxTime=30-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 
DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF 
ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO 
ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP 
Nodes=gpunode[001-004]
PartitionName=gpu-debug Default=NO MinNodes=1 MaxTime=00:30:00 AllowGroups=ALL 
PriorityJobFactor=2 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO 
Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 
AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=NO 
OverTimeLimit=0 State=UP Nodes=gpunode[001-004]
PartitionName=gpu-interactive Default=NO MinNodes=1 MaxNodes=2 MaxTime=02:00:00 
AllowGroups=ALL PriorityJobFactor=3 PriorityTier=1 DisableRootJobs=NO 
RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO 
DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO 
OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=gpunode[001-004]
GresTypes=gpu,mic

=====

Example (where srun is a function that runs “srun --pty $SHELL -I”), with no 
CUDA_VISIBLE_DEVICES on the submit host, but is correctly set on reserving GPUs:

=====

[renfro@login ~]$ echo $CUDA_VISIBLE_DEVICES

[renfro@login ~]$ hpcshell --partition=gpu-interactive --gres=gpu:1
[renfro@gpunode003 ~]$ echo $CUDA_VISIBLE_DEVICES
0
[renfro@login ~]$ hpcshell --partition=gpu-interactive --gres=gpu:2
[renfro@gpunode004 ~]$ echo $CUDA_VISIBLE_DEVICES
0,1

=====

On Aug 30, 2018, at 4:18 AM, Chaofeng Zhang <zhang...@lenovo.com> wrote:

CUDA_VISBLE_DEVICES is used by many AI framework to determine which gpu to use, 
like tensorflow. So this environment is critical to us.

-----Original Message-----
From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Chris 
Samuel
Sent: Thursday, August 30, 2018 4:42 PM
To: slurm-users@lists.schedmd.com
Subject: [External] Re: [slurm-users] serious bug about CUDA_VISBLE_DEVICES in 
the slurm 17.11.7

On Thursday, 30 August 2018 6:38:08 PM AEST Chaofeng Zhang wrote:

The CUDA_VISBLE_DEVICES can't be set NoDevFiles in Slurm 17.11.7.
This is worked when we use Slurm 17.02.
You probably should be using cgroups instead to constrain access to GPUs.
Then it doesn't matter what you set CUDA_VISBLE_DEVICES to be as processes will 
only be able to access what they requested.

Hope that helps!
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC






Reply via email to