Re: [slurm-users] [External] Re: serious bug about CUDA_VISBLE_DEVICES in the slurm 17.11.7

Chaofeng Zhang Thu, 30 Aug 2018 08:48:41 -0700

My case, gpu resource is defined in the job file  #SBATCH --gres=gpu:2, so when 
I using srun, the CUDA_VISBLE_DEVICES=0,1 is already in the shell, I just want 
to set CUDA_VISIBLE_DEVICES=NoDevFiles in one specific srun, it can not work in 
the 17.11.7. 
But it work in 17.02


#!/bin/bash
#SBATCH --job-name=test_obj
#SBATCH --workdir=/home/hpcadmin/aaa
#SBATCH --partition=compute
#SBATCH --nodes=1
#SBATCH --mincpus=5
#SBATCH --gres=gpu:2
echo ==========no gres==================
srun -N1 -n1 --gres=none --nodelist=c1  env|grep CUDA



CUDA_VISIBLE_DEVICES=0,1


-----Original Message-----
From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Renfro, 
Michael
Sent: Thursday, August 30, 2018 10:49 PM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] [External] Re: serious bug about CUDA_VISBLE_DEVICES 
in the slurm 17.11.7

Chris’ method will set CUDA_VISIBLE_DEVICES like you’re used to, and it will 
help keep you or your users from picking conflicting devices.

My cgroup/GPU settings from slurm.conf:

=====

[renfro@login ~]$ egrep -i '(cgroup|gpu)' /etc/slurm/slurm.conf | grep -v '^#'
ProctrackType=proctrack/cgroup
TaskPlugin=task/affinity,task/cgroup
NodeName=gpunode[001-004]  CoresPerSocket=14 RealMemory=126000 Sockets=2 
ThreadsPerCore=1 Gres=gpu:2 PartitionName=gpu Default=NO MinNodes=1 
DefaultTime=1-00:00:00 MaxTime=30-00:00:00 AllowGroups=ALL PriorityJobFactor=1 
PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 
PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL 
LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP 
Nodes=gpunode[001-004] PartitionName=gpu-debug Default=NO MinNodes=1 
MaxTime=00:30:00 AllowGroups=ALL PriorityJobFactor=2 PriorityTier=1 
DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF 
ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO 
ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP 
Nodes=gpunode[001-004] PartitionName=gpu-interactive Default=NO MinNodes=1 
MaxNodes=2 MaxTime=02:00:00 AllowGroups=ALL PriorityJobFactor=3 PriorityTier=1 
DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF 
ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO 
ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP 
Nodes=gpunode[001-004] GresTypes=gpu,mic

=====

Example (where srun is a function that runs “srun --pty $SHELL -I”), with no 
CUDA_VISIBLE_DEVICES on the submit host, but is correctly set on reserving GPUs:

=====

[renfro@login ~]$ echo $CUDA_VISIBLE_DEVICES

[renfro@login ~]$ hpcshell --partition=gpu-interactive --gres=gpu:1
[renfro@gpunode003 ~]$ echo $CUDA_VISIBLE_DEVICES
0
[renfro@login ~]$ hpcshell --partition=gpu-interactive --gres=gpu:2
[renfro@gpunode004 ~]$ echo $CUDA_VISIBLE_DEVICES
0,1

=====

> On Aug 30, 2018, at 4:18 AM, Chaofeng Zhang <zhang...@lenovo.com> wrote:
> 
> CUDA_VISBLE_DEVICES is used by many AI framework to determine which gpu to 
> use, like tensorflow. So this environment is critical to us.
> 
> -----Original Message-----
> From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of 
> Chris Samuel
> Sent: Thursday, August 30, 2018 4:42 PM
> To: slurm-users@lists.schedmd.com
> Subject: [External] Re: [slurm-users] serious bug about 
> CUDA_VISBLE_DEVICES in the slurm 17.11.7
> 
> On Thursday, 30 August 2018 6:38:08 PM AEST Chaofeng Zhang wrote:
> 
>> The CUDA_VISBLE_DEVICES can't be set NoDevFiles in Slurm 17.11.7.  
>> This is worked when we use Slurm 17.02.
> 
> You probably should be using cgroups instead to constrain access to GPUs.  
> Then it doesn't matter what you set CUDA_VISBLE_DEVICES to be as processes 
> will only be able to access what they requested.
> 
> Hope that helps!
> Chris
> --
> Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
> 
> 
> 
> 
>

Re: [slurm-users] [External] Re: serious bug about CUDA_VISBLE_DEVICES in the slurm 17.11.7

Reply via email to