Hey folks, Here is my setup:
slurm-20.11.4 on x86_64 running Centos 7.x with CUDA 11.1 The relevant parts of the slurm.conf and a particular gres.conf file are: SelectType=select/cons_res SelectTypeParameters=CR_Core PriorityType=priority/multifactor GresTypes=gpu NodeName=dlt[01-12] Gres=gpu:8 Feature=rtx Procs=40 State=UNKNOWN PartitionName=dlt Nodes=dlt[01-12] Default=NO Shared=Exclusive MaxTime=4-00:00:00 State=UP DefaultTime=8:00:00 And the gres.conf file for those nodes [root@dlt02 ~]# more /etc/slurm/gres.conf Name=gpu File=/dev/nvidia0 Name=gpu File=/dev/nvidia1 Name=gpu File=/dev/nvidia2 Name=gpu File=/dev/nvidia3 Name=gpu File=/dev/nvidia4 Name=gpu File=/dev/nvidia5 Name=gpu File=/dev/nvidia6 Name=gpu File=/dev/nvidia7 Now for the weird part. Srun works as expected and gives me a single GPU [tim@rc-admin01 ~]$ srun -p dlt -N 1 -w dlt02 --gres=gpu:1 -A ops --pty -u /bin/bash [tim@dlt02 ~]$ env | grep CUDA *CUDA*_VISIBLE_DEVICES=0 If I submit basically the same thing with sbatch [tim@rc-admin01 ~]$ cat sbatch.test #!/bin/bash #SBATCH -N 1 #SBATCH -A ops #SBATCH -t 10 #SBATCH -p dlt #SBATCH --gres=gpu:1 #SBATCH -w dlt02 env | grep CUDA I get the following output. [tim@rc-admin01 ~]$ cat slurm-28824.out CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 Any ideas of what is going on here? Thanks in advance! This one has me stumped. ReplyForward