Hi everyone,

Your help for this would be much appreciated please.

We have a cluster with 3 types of gpu configured in gres. Users can successfully
request 2 of the gpu types but the third errors when requested.

Here is the successful salloc behaviour:

root@boole01:/etc/slurm # salloc --gres=gpu:tesla:1 -N 1
salloc: Granted job allocation 271558
[root@boole-n019:/etc/slurm]# exit
salloc: Relinquishing job allocation 271558
root@boole01:/etc/slurm # salloc --gres=gpu:volta:1 -N 1
salloc: Pending job allocation 271559
salloc: job 271559 queued and waiting for resources
^Csalloc: Job allocation 271559 has been revoked.
salloc: Job aborted due to signal

And the unsuccessful salloc behaviour:

root@boole01:/etc/slurm # salloc --gres=gpu:2080ti:1 -N 1
salloc: error: Job submit/allocate failed: Invalid generic resource (gres)

Slurm.log output for successful salloc's:

[2019-01-11T10:13:36.434] sched: _slurm_rpc_allocate_resources JobId=271558
NodeList=boole-n019 usec=30495
[2019-01-11T10:13:42.485] _job_complete: JobId=271558 WEXITSTATUS 0
[2019-01-11T10:13:42.486] _job_complete: JobId=271558 done
[2019-01-11T10:13:46.000] sched: _slurm_rpc_allocate_resources JobId=271559
NodeList=(null) usec=15674
[2019-01-11T10:13:48.778] _job_complete: JobId=271559 WTERMSIG 126
[2019-01-11T10:13:48.778] _job_complete: JobId=271559 cancelled by interactive
[2019-01-11T10:13:48.778] _job_complete: JobId=271559 done

Slurm.log output for unsuccessful salloc's:

[2019-01-11T10:13:55.755] _get_next_job_gres: Invalid GRES job specification
[2019-01-11T10:13:55.755] _slurm_rpc_allocate_resources: Invalid generic
resource (gres) specification

Slurm gres configuration:

root@boole01:/etc/slurm # grep -i gres slurm.conf | grep -v ^#
NodeName=boole-n[018-023] Gres=gpu:tesla:2 RealMemory=256000 Sockets=2
CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN Weight=50
NodeName=boole-n024 Gres=gpu:2080ti:2 RealMemory=256000 Sockets=2
CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN Weight=100
NodeName=boole-n016 Gres=gpu:volta:2 RealMemory=256000 Sockets=2
CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN Weight=200


root@boole01:/etc/slurm # cat gres.conf
NodeName=boole-n[018-023] Name=gpu Type=tesla File=/dev/nvidia0
NodeName=boole-n[018-023] Name=gpu Type=tesla File=/dev/nvidia1
NodeName=boole-n024 Name=gpu Type=2080ti File=/dev/nvidia0
NodeName=boole-n024 Name=gpu Type=2080ti File=/dev/nvidia1
NodeName=boole-n016 Name=gpu Type=volta File=/dev/nvidia0
NodeName=boole-n016 Name=gpu Type=volta File=/dev/nvidia1
#NodeName=boole-n017 Name=mic File=/dev/mic0
#NodeName=boole-n017 Name=mic File=/dev/mic1

Please let me know if there is anymore info that would be helpful for this?

What am I missing or doing wrong?

Many thanks in advance.


Sean McGrath M.Sc

Systems Administrator
Trinity Centre for High Performance and Research Computing
Trinity College Dublin



+353 (0) 1 896 3725

Reply via email to