[slurm-users] inconsistent CUDA_VISIBLE_DEVICES with srun vs sbatch

Tim Carlson Wed, 19 May 2021 11:28:28 -0700

Hey folks,

Here is my setup:


slurm-20.11.4 on x86_64  running Centos 7.x with CUDA 11.1

The relevant parts of the slurm.conf  and a particular gres.conf file are:

SelectType=select/cons_res

SelectTypeParameters=CR_Core

PriorityType=priority/multifactor

GresTypes=gpu


NodeName=dlt[01-12]  Gres=gpu:8 Feature=rtx Procs=40 State=UNKNOWN

PartitionName=dlt Nodes=dlt[01-12] Default=NO Shared=Exclusive
MaxTime=4-00:00:00 State=UP DefaultTime=8:00:00


And the gres.conf file for those nodes


[root@dlt02 ~]# more /etc/slurm/gres.conf

Name=gpu File=/dev/nvidia0

Name=gpu File=/dev/nvidia1

Name=gpu File=/dev/nvidia2

Name=gpu File=/dev/nvidia3

Name=gpu File=/dev/nvidia4

Name=gpu File=/dev/nvidia5

Name=gpu File=/dev/nvidia6

Name=gpu File=/dev/nvidia7


Now for the weird part. Srun works as expected and gives me a single GPU


[tim@rc-admin01 ~]$ srun -p dlt -N 1 -w dlt02 --gres=gpu:1 -A ops --pty -u
/bin/bash

[tim@dlt02 ~]$ env | grep CUDA

*CUDA*_VISIBLE_DEVICES=0


If I submit basically the same thing with sbatch


[tim@rc-admin01 ~]$ cat sbatch.test

#!/bin/bash

#SBATCH -N 1

#SBATCH -A ops

#SBATCH -t 10

#SBATCH -p dlt

#SBATCH --gres=gpu:1

#SBATCH -w dlt02

env | grep CUDA


I get the following output.


[tim@rc-admin01 ~]$ cat slurm-28824.out

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7



Any ideas of what is going on here?


Thanks in advance! This one has me stumped.
ReplyForward

[slurm-users] inconsistent CUDA_VISIBLE_DEVICES with srun vs sbatch

Reply via email to