Hello,we are running 18.08.6 and has problems with GRES GPU management. There is "gpu" partition with 12 nodes each with 4 Tesla V100 cards. An allocation of the GPUs is working, GPU management for sbatch/srun jobs is working too - CUDA_VISIBLE_DEVICES is correctly set according --gres=gpu:x option. But we have problems with GPU management for job steps. If I'll try this example:
#!/bin/bash # # gres_test.bash # Submit as follows: # sbatch -p gpu --gres=gpu:4 -n4 gres_test.bash # echo JOB $SLURM_JOB_ID CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES srun --gres=gpu:1 -n1 --exclusive show_device.sh & srun --gres=gpu:1 -n1 --exclusive show_device.sh & srun --gres=gpu:1 -n1 --exclusive show_device.sh & srun --gres=gpu:1 -n1 --exclusive show_device.sh & wait cat show_devices.sh #!/bin/bashecho JOB $SLURM_JOB_ID STEP $SLURM_STEP_ID CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
I'll get: JOB 49614 CUDA_VISIBLE_DEVICES=0,1,2,3 JOB 49614 STEP 0 CUDA_VISIBLE_DEVICES=0 JOB 49614 STEP 1 CUDA_VISIBLE_DEVICES=0 JOB 49614 STEP 2 CUDA_VISIBLE_DEVICES=0 JOB 49614 STEP 3 CUDA_VISIBLE_DEVICES=0 But according: https://slurm.schedmd.com/gres.html I'm expecting: JOB 49614 CUDA_VISIBLE_DEVICES=0,1,2,3 JOB 49614 STEP 0 CUDA_VISIBLE_DEVICES=0 JOB 49614 STEP 1 CUDA_VISIBLE_DEVICES=1 JOB 49614 STEP 2 CUDA_VISIBLE_DEVICES=2 JOB 49614 STEP 3 CUDA_VISIBLE_DEVICES=3So we are not able distribute jobs to different GPUs inside sbatch . We can use some wrapper like this:
#!/bin/bash export CUDA_VISIBLE_DEVICES=$SLURM_STEPID my_job but SLURM built-in solution is better and more robust. GRES section of slurm.conf AccountingStorageTRES=gres/gpu JobAcctGatherType=jobacct_gather/cgroup GresTypes=gpuNodeName=n[21-32] Gres=gpu:v100:4 Sockets=2 CoresPerSocket=18 ThreadsPerCore=2 RealMemory=384000 TmpDisk=150000 State=UNKNOWN Weight=1000 PartitionName=gpu Nodes=n[21-32] Default=NO MaxTime=24:00:00 State=UP Priority=5 PriorityTier=15 OverSubscribe=FORCE
/etc/slurm/gres.conf Name=gpu Type=v100 File=/dev/nvidia0 CPUs=0-17,36-53 Name=gpu Type=v100 File=/dev/nvidia1 CPUs=0-17,36-53 Name=gpu Type=v100 File=/dev/nvidia2 CPUs=18-35,54-71 Name=gpu Type=v100 File=/dev/nvidia3 CPUs=18-35,54-71 Any help appreciated. Thanks, Daniel Vecerka CTU Prague
smime.p7s
Description: S/MIME Cryptographic Signature