
I just added a 3rd node to my slurm partition (called "hsw5"), as we
continue to enable Slurm in our environment.  But the new node is not
accepting jobs that require a GPU, despite the fact that it has 3 GPUs.

The other node that has a GPU ("devops3") is accepting GPU jobs as
expected.  A colleague pointed out an interesting difference (under the
GRES column) when we did this command:

(! 676)-> sinfo -o "%20N  %10c  %10m  %25f  %20G "
devops2               4           9913        avx,centos,fast,fma,fma4,
devops3               8           40213       centos,cuda10.1p,cuda10.2
hsw5                  64          257847      foo,bar

Is there a problem with the GPU bindings on "hsw5"?  Do GPUs need to be
associated with sockets, or something like that?

Here is the error message I'm seeing:

(! 681)-> /opt/slurm-20.11.5/bin/sbatch --export=NONE -N 1 --constraint foo
--gpus=1 --wrap "ls"
sbatch: error: Batch job submission failed: Requested node configuration is
not available

(! 682)-> /opt/slurm-20.11.5/bin/sbatch --export=NONE -N 1 --constraint foo
 --wrap "ls"
Submitted batch job 385

Thanks for the help,


Reply via email to