[slurm-users] How should I configure a node with Autodetect=nvml?

Dean Schulze Mon, 10 Feb 2020 12:14:42 -0800

In the gres.conf on one of my nodes I have just the line

    Autodetect=nvml


as in the last example in https://slurm.schedmd.com/gres.conf.html.

In the slurm.conf on all nodes I have this line for the node with
Autodetect=nvml

    NodeName=slurmnode1 CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8
ThreadsPerCore=2 RealMemory=47671 Gres=gpu:gp100:4

since that node can have up to 4 gpus dynamically assigned.  Without the
Gres=gpu:gp100:4 I can't run any job that requires a gpu even if I
dynamically assign gpus on that node.  Apparently Autodetect=nvml isn't
enough to let the controller know that there are gpus available on that
node.

With this configuration I get this message every second in my slurmctld.log
file:

    error: _slurm_rpc_node_registration node=slurmnode1: Invalid argument

I've restarted both slurmd and slurmctld and still get the error.  That
node also stays in the drain state no matter what I do with it.  Apparently
slurm doesn't like this configuration.

What is the right way to configure a node with Autodetect=nvml?

[slurm-users] How should I configure a node with Autodetect=nvml?

Reply via email to