Hi everyone, Your help for this would be much appreciated please.
We have a cluster with 3 types of gpu configured in gres. Users can successfully request 2 of the gpu types but the third errors when requested. Here is the successful salloc behaviour: root@boole01:/etc/slurm # salloc --gres=gpu:tesla:1 -N 1 salloc: Granted job allocation 271558 [root@boole-n019:/etc/slurm]# exit salloc: Relinquishing job allocation 271558 root@boole01:/etc/slurm # salloc --gres=gpu:volta:1 -N 1 salloc: Pending job allocation 271559 salloc: job 271559 queued and waiting for resources ^Csalloc: Job allocation 271559 has been revoked. salloc: Job aborted due to signal And the unsuccessful salloc behaviour: root@boole01:/etc/slurm # salloc --gres=gpu:2080ti:1 -N 1 salloc: error: Job submit/allocate failed: Invalid generic resource (gres) specification Slurm.log output for successful salloc's: [2019-01-11T10:13:36.434] sched: _slurm_rpc_allocate_resources JobId=271558 NodeList=boole-n019 usec=30495 [2019-01-11T10:13:42.485] _job_complete: JobId=271558 WEXITSTATUS 0 [2019-01-11T10:13:42.486] _job_complete: JobId=271558 done [2019-01-11T10:13:46.000] sched: _slurm_rpc_allocate_resources JobId=271559 NodeList=(null) usec=15674 [2019-01-11T10:13:48.778] _job_complete: JobId=271559 WTERMSIG 126 [2019-01-11T10:13:48.778] _job_complete: JobId=271559 cancelled by interactive user [2019-01-11T10:13:48.778] _job_complete: JobId=271559 done Slurm.log output for unsuccessful salloc's: [2019-01-11T10:13:55.755] _get_next_job_gres: Invalid GRES job specification gpu:2080ti:1 [2019-01-11T10:13:55.755] _slurm_rpc_allocate_resources: Invalid generic resource (gres) specification Slurm gres configuration: root@boole01:/etc/slurm # grep -i gres slurm.conf | grep -v ^# GresTypes=gpu,mic NodeName=boole-n[018-023] Gres=gpu:tesla:2 RealMemory=256000 Sockets=2 CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN Weight=50 NodeName=boole-n024 Gres=gpu:2080ti:2 RealMemory=256000 Sockets=2 CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN Weight=100 NodeName=boole-n016 Gres=gpu:volta:2 RealMemory=256000 Sockets=2 CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN Weight=200 gres.conf: root@boole01:/etc/slurm # cat gres.conf NodeName=boole-n[018-023] Name=gpu Type=tesla File=/dev/nvidia0 NodeName=boole-n[018-023] Name=gpu Type=tesla File=/dev/nvidia1 NodeName=boole-n024 Name=gpu Type=2080ti File=/dev/nvidia0 NodeName=boole-n024 Name=gpu Type=2080ti File=/dev/nvidia1 NodeName=boole-n016 Name=gpu Type=volta File=/dev/nvidia0 NodeName=boole-n016 Name=gpu Type=volta File=/dev/nvidia1 #NodeName=boole-n017 Name=mic File=/dev/mic0 #NodeName=boole-n017 Name=mic File=/dev/mic1 Please let me know if there is anymore info that would be helpful for this? What am I missing or doing wrong? Many thanks in advance. Sean -- Sean McGrath M.Sc Systems Administrator Trinity Centre for High Performance and Research Computing Trinity College Dublin sean.mcgr...@tchpc.tcd.ie https://www.tcd.ie/ https://www.tchpc.tcd.ie/ +353 (0) 1 896 3725