Hello all. My team is enabling slurm (version 20.11.5) in our environment, and we got a controller up and running, along with 2 nodes. Everything was working fine. However, when we try to enable configless mode, I ran into a problem. The node that has a GPU is coming up in "drained" state, and sinfo -Nl shows the following:
(dhenkemeyer)-(devops1)-(x86_64-redhat-linux-gnu)-(~/slurm/bin) (! 726)-> sinfo -Nl Fri May 07 10:20:20 2021 NODELIST NODES PARTITION STATE CPUS S:c:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON devops2 1 debug* idle 4 1:4:1 9913 0 1 avx,cent none devops3 1 debug* drained 8 2:4:1 40213 0 1 foo,bar gres/gpu count repor As you can see, it appears to be related to the gres/gpu count. Here is the entry for the node, in the slurm.conf file (which is attached) on the controller: NodeName=devops3 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=40213 Features=foo,bar Gres=gpu:kepler:1 Prior to this, we also tried a simpler way of expressing Gres: NodeName=devops3 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=40213 Features=foo,bar Gres=gpu:1 But that also failed.I am logging on the controller, and have enabled debug output when I launch slurmd on the nodes. On the problematic node (the one with a GPU), I am seeing this repeating message: slurmd: debug: Unable to register with slurm controller, retrying and on the controller, I am seeing this repeating message: [2021-05-07T10:23:30.417] error: _slurm_rpc_node_registration node=devops3: Invalid argument So they are definitely related. Any help would be appreciated. I tried moving the slurm.conf file from the GPU node to the controller, but that caused our non-GPU node to puke on startup: slurmd: fatal: We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurmâslurmd: debug: Unable to register with sl was configured.
slurm.conf
Description: Binary data