Hello,

I am trying to rewrite my gres.conf file.

Before changes, this file was just like this:
NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceRTX2070 
File=/dev/nvidia0 Cores=0-11
NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti 
File=/dev/nvidia1 Cores=12-23
NodeName=node-gpu-2 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti 
File=/dev/nvidia0 Cores=0-11
NodeName=node-gpu-2 AutoDetect=off Name=gpu Type=GeForceGTX1080 
File=/dev/nvidia1 Cores=12-23
NodeName=node-gpu-3 AutoDetect=off Name=gpu Type=GeForceRTX3080 
File=/dev/nvidia0 Cores=0-11
NodeName=node-gpu-4 AutoDetect=off Name=gpu Type=GeForceRTX3080 
File=/dev/nvidia0 Cores=0-7
# you can seee that nodes node-gpu-1 and node-gpu-2 have two GPUs each one, 
whereas nodes node-gpu-3 and node-gpu-4 have only one GPU each one


And my slurmd.conf was this:
[...]
AccountingStorageTRES=gres/gpu
GresTypes=gpu
NodeName=node-gpu-1 CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 
RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceRTX2070:1,gpu:GeForceGTX1080Ti:1
NodeName=node-gpu-2 CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 
RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceGTX1080Ti:1,gpu:GeForceGTX1080:1
NodeName=node-gpu-3 CPUs=12 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 
RealMemory=23000 Gres=gpu:GeForceRTX3080:1
NodeName=node-gpu-4 CPUs=8 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 
RealMemory=7800 Gres=gpu:GeForceRTX3080:1
NodeName=node-worker-[0-22] CPUs=12 SocketsPerBoard=1 CoresPerSocket=6 
ThreadsPerCore=2 RealMemory=47000
[...]

With this configuration, all seems works fine, except slurmctld.log reports:
[...]
error: _node_config_validate: gres/gpu: invalid GRES core specification (0-11) 
on node node-gpu-3
error: _node_config_validate: gres/gpu: invalid GRES core specification (12-23) 
on node node-gpu-1
error: _node_config_validate: gres/gpu: invalid GRES core specification (12-23) 
on node node-gpu-2
error: _node_config_validate: gres/gpu: invalid GRES core specification (0-7) 
on node node-gpu-4
[...]

However, even these errors, users can submit jobs and request GPUs resources.



Now, I have tried to reconfigure gres.conf and slurmd.conf in this way:
gres.conf:
Name=gpu Type=GeForceRTX2070 File=/dev/nvidia0
Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia1
Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia0
Name=gpu Type=GeForceGTX1080 File=/dev/nvidia1
Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0
Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0
# there is no NodeName attribute

slurmd.conf:
[...]
NodeName=node-gpu-1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 
RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceRTX2070:1,gpu:GeForceGTX1080Ti:1
NodeName=node-gpu-2 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 
RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceGTX1080Ti:1,gpu:GeForceGTX1080:1
NodeName=node-gpu-3 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 
RealMemory=23000 Gres=gpu:GeForceRTX3080:1
NodeName=node-gpu-4 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 
RealMemory=7800 Gres=gpu:GeForceRTX3080:1
NodeName=node-worker-[0-22] SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 
RealMemory=47000
# there is no CPUs attribute
[...]


With this new configuration, nodes with GPU start correctly slurmd.service 
daemon, but nodes without GPU (node-worker-[0-22]) can't start slurmd.service 
daemon and returns this error:
[...]
error: Waiting for gres.conf file /dev/nvidia0
fatal: can't stat gres.conf file /dev/nvidia0: No such file or directory
[...]

It seems SLURM is waiting that "node-workers" have also an nvidia GPU but not, 
theses nodes haven't GPU... So, where is my configuration error?

I have read in https://slurm.schedmd.com/gres.conf.html about syntax and 
examples but it seems I'm doing some wrong.

Thanks!!
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to