Dear, Danny Auble, Developers and Users, After update SLURM from version 17.02.2 to the 17.11.6 The behavior of the plugin gres has changed.
On the version 17.02.2, gres.conf can be: Name=gpu Type=K40 File=/dev/nvidia0 COREs=0 Name=gpu Type=K40 File=/dev/nvidia1 COREs=10 Name=gpu Type=cpu COREs=2-9,12-19 Count=16 Name=gpu Type=debugcpu COREs=1,11 Count=2 All GPU jobs starts succesfully on the slurm-17.02.2. But now slurm-17.11.6 does not set variables CUDA_VISIBLE_DEVICES, and all jobs on the same node use only one GPU. This is due to the generation of errors in the function common_gres_set_env(...) in file s rc/plugins/gres/common/gres_common.c:196 for this len = bit_size(bit_alloc); //Equal to 20 list_count(gres_devices) equl to 2 Why is gres.conf not working now? I use this gres.conf to be sure that COREs=0,10 used only with GPU and never for tasks without GPU. In general bug can be in src/plugins/gres/common/gres_common.c src/common/gres.c I think I can not fix this problem by myself. Who know solution for this problem? Best regards, Vova.