The GPU nodes shouldn’t have any config files — they come in from the 
controller with configless (i.e. all config files are centralized.)

Now, did you build Slurm on the gpu nodes, or install via package mgr? If pkg 
mgr, do you know if it was compiled/packaged on a node with the NVIDIA libs? 
(If it couldn’t find the NV libs when compiled, nvml support wouldn’t be 
built...)


________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of David 
Henkemeyer <david.henkeme...@gmail.com>
Sent: Friday, May 7, 2021 8:31:16 PM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Configless mode enabling issue

Thank you for the reply, Will!

The slurm.conf file only has one line in it:

AutoDetect=nvml

During my debug, I copied this file from the GPU node to the controller.  But, 
that's when I noticed that the node w/o a GPU then crashed on startup.

David

On Fri, May 7, 2021 at 12:14 PM Will Dennis 
<wden...@nec-labs.com<mailto:wden...@nec-labs.com>> wrote:
Hi David,

What is the gres.conf on the controller’s /etc/slurm ? Is it autodetect via 
nvml?

In configless the slurm.conf, gres.conf, etc is just maintained on the 
controller, and the worker nodes get it from there automatically (you don’t 
want those files on the worker nodes.) If you need to see what the slurmd 
daemon is seeing/doing in real-time, start slurmd on the node via 
“slurmd-Dvvvv” and you will see the log mssgs on stdout. (If it normally runs 
via systemd, then “systemctl stop slurmd” 1st.)

Regards,
Will


________________________________
From: slurm-users 
<slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of David Henkemeyer 
<david.henkeme...@gmail.com<mailto:david.henkeme...@gmail.com>>
Sent: Friday, May 7, 2021 2:41:41 PM
To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com> 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>
Subject: [slurm-users] Configless mode enabling issue

Hello all. My team is enabling slurm (version 20.11.5) in our environment, and 
we got a controller up and running, along with 2 nodes.  Everything was working 
fine.  However, when we try to enable configless mode, I ran into a problem.  
The node that has a GPU is coming up in "drained" state, and sinfo -Nl shows 
the following:

(dhenkemeyer)-(devops1)-(x86_64-redhat-linux-gnu)-(~/slurm/bin)
(! 726)-> sinfo -Nl
Fri May 07 10:20:20 2021
NODELIST   NODES PARTITION       STATE CPUS    S:c:T MEMORY TMP_DISK WEIGHT 
AVAIL_FE REASON
devops2        1    debug*        idle 4       1:4:1   9913        0      1 
avx,cent none
devops3        1    debug*     drained 8       2:4:1  40213        0      1  
foo,bar gres/gpu count repor

As you can see, it appears to be related to the gres/gpu count.  Here is the 
entry for the node, in the slurm.conf file (which is attached) on the 
controller:

NodeName=devops3 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=40213 
Features=foo,bar Gres=gpu:kepler:1

Prior to this, we also tried a simpler way of expressing Gres:

NodeName=devops3 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=40213 
Features=foo,bar Gres=gpu:1

But that also failed.I am logging on the controller, and have enabled debug 
output when I launch slurmd on the nodes.  On the problematic node (the one 
with a GPU), I am seeing this repeating message:

slurmd: debug:  Unable to register with slurm controller, retrying

and on the controller, I am seeing this repeating message:

[2021-05-07T10:23:30.417] error: _slurm_rpc_node_registration node=devops3: 
Invalid argument

So they are definitely related.  Any help would be appreciated.  I tried moving 
the slurm.conf file from the GPU node to the controller, but that caused our 
non-GPU node to puke on startup:

slurmd: fatal: We were configured to autodetect nvml functionality, but we 
weren't able to find that lib when Slurm│slurmd: debug:  Unable to register 
with sl
 was configured.

Reply via email to