Hi community,
I am unable to tell if SLURM is handling the following situation
efficiently in terms of CPU affinities at each partition.
Here we have a very small cluster with just one GPU node with 8x GPUs, that
offers two partitions --> "gpu" and "cpu".
Part of the Config File
## Nodes list
## u
The GPU nodes shouldn’t have any config files — they come in from the
controller with configless (i.e. all config files are centralized.)
Now, did you build Slurm on the gpu nodes, or install via package mgr? If pkg
mgr, do you know if it was compiled/packaged on a node with the NVIDIA libs?
(I
Thank you for the reply, Will!
The slurm.conf file only has one line in it:
AutoDetect=nvml
During my debug, I copied this file from the GPU node to the controller.
But, that's when I noticed that the node w/o a GPU then crashed on startup.
David
On Fri, May 7, 2021 at 12:14 PM Will Dennis wr
Hi David,
What is the gres.conf on the controller’s /etc/slurm ? Is it autodetect via
nvml?
In configless the slurm.conf, gres.conf, etc is just maintained on the
controller, and the worker nodes get it from there automatically (you don’t
want those files on the worker nodes.) If you need to s
Hello all. My team is enabling slurm (version 20.11.5) in our environment,
and we got a controller up and running, along with 2 nodes. Everything was
working fine. However, when we try to enable configless mode, I ran into a
problem. The node that has a GPU is coming up in "drained" state, and
s