I agree that people are making updating slurm.conf a bigger issue than
people are making it out to be. However, there are certain config
changes that do require restarting the daemon rather than just doing
'scontrol reconfigure.' these options are documented in the slurm.conf
documentation (just search for "restart")
I believe it's often only the slurmctld that needs to be restarted,
which is one daemon on one system, rather than restarting slurmd on all
the compute nodes, but there are a few that require all slurm daemons
being restarted. Adding nodes to a cluster is one of them:
Changes in node configuration (e.g. adding nodes, changing their
processor count, etc.) require restarting both the slurmctld daemon
and the slurmd daemons. All slurmd daemons must know each node in the
system to forward messages in support of hierarchical communications
But to avoid this, you can use the future setting to define "future" nodes:
*FUTURE*
Indicates the node is defined for future use and need not exist
when the Slurm daemons are started. These nodes can be made
available for use simply by updating the node state using the
scontrol command rather than restarting the slurmctld daemon.
After these nodes are made available, change their State in the
slurm.conf file. Until these nodes are made available, they will
not be seen using any Slurm commands or nor will any attempt be
made to contact them.
--
Prentice
On 5/4/21 8:32 AM, Sid Young wrote:
You can push a new conf file and issue an "scontrol reconfigure" on
the fly as needed... I do it on our cluster as needed, do the nodes
first then login nodes then the slurm controller... you are making a
huge issue of a very basic task...
Sid
On Tue, 4 May 2021, 22:28 Tina Friedrich, <tina.friedr...@it.ox.ac.uk
<mailto:tina.friedr...@it.ox.ac.uk>> wrote:
Hello,
a lot of people already gave very good answer to how to tackle this.
Still, I thought it worth pointing this out - you said 'you need to
basically shut down slurm, update the slurm.conf file, then restart'.
That makes it sound like a major operation with lots of prep required.
It's not like that at all. Updating slurm.conf is not a major
operation.
There's absolutely no reason to shut things down first & then
change the
file. You can edit the file / ship out a new version (however you
like)
and then restart the daemons.
The daemons do not have to all be restarted simultaneously. It is
of no
consequence if they're running with out-of-sync config files for a
bit,
really. (There's a flag you can set if you want to suppress the
warning
- 'NO_CONF_HASH' debug flag I think).
Restarting the dameons (slurmctld, slurmd, ...) is safe. It does not
require cluster downtime or anything.
I control slurm.conf using configuration management; the config
management process restarts the appropriate daemon (slurmctld,
slurmd,
slurmdbd) if the file changed. This certainly never happens at the
same
time; there's splay in that. It doesn't even necessarily happen on
the
controller first, or anything like that.
What I'm trying to get across - I have a feeling this 'updating the
cluster wide config file' and 'file must be the same on all nodes'
is a
lot less of a procedure (and a lot less strict) than you currently
imagine it to be :)
Tina
On 27/04/2021 19:35, David Henkemeyer wrote:
> Hello,
>
> I'm new to Slurm (coming from PBS), and so I will likely have a few
> questions over the next several weeks, as I work to transition my
> infrastructure from PBS to Slurm.
>
> My first question has to do with *_adding nodes to Slurm_*.
According
> to the FAQ (and other articles I've read), you need to basically
shut
> down slurm, update the slurm.conf file /*on all nodes in the
cluster*/,
> then restart slurm.
>
> - Why do all nodes need to know about all other nodes? From what
I have
> read, its Slurm does a checksum comparison of the slurm.conf
file across
> all nodes. Is this the only reason all nodes need to know about
all
> other nodes?
> - Can I create a symlink that points <sysconfdir>/slurm.conf to a
> slurm.conf file on an NFS mount point, which is mounted on all the
> nodes? This way, I would only need to update a single file, then
> restart Slurm across the entire cluster.
> - Any additional help/resources for adding/removing nodes to
Slurm would
> be much appreciated. Perhaps there is a "toolkit" out there to
automate
> some of these operations (which is what I already have for PBS,
and will
> create for Slurm, if something doesn't already exist).
>
> Thank you all,
>
> David
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems
Administrator
Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk <http://www.arc.ox.ac.uk>
http://www.it.ox.ac.uk <http://www.it.ox.ac.uk>