Thank you for the reply, Will! The slurm.conf file only has one line in it:
AutoDetect=nvml During my debug, I copied this file from the GPU node to the controller. But, that's when I noticed that the node w/o a GPU then crashed on startup. David On Fri, May 7, 2021 at 12:14 PM Will Dennis <wden...@nec-labs.com> wrote: > Hi David, > > What is the gres.conf on the controller’s /etc/slurm ? Is it autodetect > via nvml? > > In configless the slurm.conf, gres.conf, etc is just maintained on the > controller, and the worker nodes get it from there automatically (you don’t > want those files on the worker nodes.) If you need to see what the slurmd > daemon is seeing/doing in real-time, start slurmd on the node via > “slurmd-Dvvvv” and you will see the log mssgs on stdout. (If it normally > runs via systemd, then “systemctl stop slurmd” 1st.) > > Regards, > Will > > > ------------------------------ > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of > David Henkemeyer <david.henkeme...@gmail.com> > *Sent:* Friday, May 7, 2021 2:41:41 PM > *To:* slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com> > *Subject:* [slurm-users] Configless mode enabling issue > > Hello all. My team is enabling slurm (version 20.11.5) in our environment, > and we got a controller up and running, along with 2 nodes. Everything was > working fine. However, when we try to enable configless mode, I ran into a > problem. The node that has a GPU is coming up in "drained" state, and > sinfo -Nl shows the following: > > (dhenkemeyer)-(devops1)-(x86_64-redhat-linux-gnu)-(~/slurm/bin) > (! 726)-> sinfo -Nl > Fri May 07 10:20:20 2021 > NODELIST NODES PARTITION STATE CPUS S:c:T MEMORY TMP_DISK WEIGHT > AVAIL_FE REASON > devops2 1 debug* idle 4 1:4:1 9913 0 1 > avx,cent none > devops3 1 debug* drained 8 2:4:1 40213 0 1 > foo,bar gres/gpu count repor > > As you can see, it appears to be related to the gres/gpu count. Here is > the entry for the node, in the slurm.conf file (which is attached) on the > controller: > > NodeName=devops3 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=40213 > Features=foo,bar Gres=gpu:kepler:1 > > Prior to this, we also tried a simpler way of expressing Gres: > > NodeName=devops3 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=40213 > Features=foo,bar Gres=gpu:1 > > But that also failed.I am logging on the controller, and have enabled > debug output when I launch slurmd on the nodes. On the problematic node > (the one with a GPU), I am seeing this repeating message: > > slurmd: debug: Unable to register with slurm controller, retrying > > and on the controller, I am seeing this repeating message: > > [2021-05-07T10:23:30.417] error: _slurm_rpc_node_registration node=devops3: > Invalid argument > > So they are definitely related. Any help would be appreciated. I tried > moving the slurm.conf file from the GPU node to the controller, but that > caused our non-GPU node to puke on startup: > > slurmd: fatal: We were configured to autodetect nvml functionality, but we > weren't able to find that lib when Slurm│slurmd: debug: Unable to register > with sl > was configured. > >