[slurm-users] What happens if GPU GRES exceeding number of GPUs per node

Purwanto, Wirawan Wed, 17 Jan 2024 07:55:48 -0800

Hi,

In my HPC center, I found a SLURM job that was submitted with --gres=gpu:6 
whereas the cluster has only four GPUs per node each. It is a parallel job. 
Here are some relevant field printout:


AllocCPUS                                      30
AllocGRES                                   gpu:6
AllocTRES     billing=30,cpu=30,gres/gpu=6,node=3
CPUTime                                1-01:23:00
CPUTimeRAW                                  91380
Elapsed                                  00:50:46
JobID                                       20073
JobIDRaw                                    20073
JobName                               simple_cuda
NCPUS                                          30
NGPUS                                         6.0

What happened in this case? This job was asking for 3 nodes, 10 core per node. 
When the user specified “--gres=gpu:6”, does this mean six GPUs for the entire 
job, or six GPUs per node? Per the description in 
https://slurm.schedmd.com/gres.html#Running_Jobs, it says: gres is “Generic 
resources required per node”. So it is illogical to request six GPUs per node. 
So what happened? Did SLURM quietly ignore the request and grant just one, or 
grant the max number (4)? Because apparently the job ran without error.

Wirawan Purwanto
Computational Scientist, HPC Group
Information Technology Services
Old Dominion University
Norfolk, VA 23529

[slurm-users] What happens if GPU GRES exceeding number of GPUs per node

Reply via email to