Dear, Danny Auble, Developers and Users,

After update SLURM from version 17.02.2 to the 17.11.6 The behavior of the
plugin gres has changed.

On the version 17.02.2, gres.conf can be:
Name=gpu Type=K40 File=/dev/nvidia0   COREs=0
Name=gpu Type=K40 File=/dev/nvidia1   COREs=10
Name=gpu Type=cpu                     COREs=2-9,12-19 Count=16
Name=gpu Type=debugcpu                COREs=1,11      Count=2

All GPU jobs starts succesfully on the slurm-17.02.2.
But now slurm-17.11.6 does not set variables CUDA_VISIBLE_DEVICES, and all
jobs on the same node use only one GPU.  This is due to the generation of
errors in the function common_gres_set_env(...) in file s
rc/plugins/gres/common/gres_common.c:196
for this
len = bit_size(bit_alloc); //Equal to 20
list_count(gres_devices) equl to 2

Why is gres.conf not working now? I use this gres.conf to be sure that
COREs=0,10 used only with GPU and never for tasks without GPU.

In general bug can be in
src/plugins/gres/common/gres_common.c
src/common/gres.c

I think I can not fix this problem by myself.
Who know solution for this problem?

Best regards, Vova.

Reply via email to