Re: [slurm-users] [External] Re: Questions about adding new nodes to Slurm

Prentice Bisbal Tue, 04 May 2021 16:16:14 -0700

I agree that people are making updating slurm.conf a bigger issue thanpeople are making it out to be. However, there are certain configchanges that do require restarting the daemon rather than just doing'scontrol reconfigure.' these options are documented in the slurm.confdocumentation (just search for "restart")

I believe it's often only the slurmctld that needs to be restarted,which is one daemon on one system, rather than restarting slurmd on allthe compute nodes, but there are a few that require all slurm daemonsbeing restarted. Adding nodes to a cluster is one of them:

Changes in node configuration (e.g. adding nodes, changing theirprocessor count, etc.) require restarting both the slurmctld daemonand the slurmd daemons. All slurmd daemons must know each node in thesystem to forward messages in support of hierarchical communications


But to avoid this, you can use the future setting to define "future" nodes:

*FUTURE*
    Indicates the node is defined for future use and need not exist
    when the Slurm daemons are started. These nodes can be made
    available for use simply by updating the node state using the
    scontrol command rather than restarting the slurmctld daemon.
    After these nodes are made available, change their State in the
    slurm.conf file. Until these nodes are made available, they will
    not be seen using any Slurm commands or nor will any attempt be

made to contact them.

--
Prentice

On 5/4/21 8:32 AM, Sid Young wrote:

You can push a new conf file and issue an "scontrol reconfigure" onthe fly as needed... I do it on our cluster as needed, do the nodesfirst then login nodes then the slurm controller... you are making ahuge issue of a very basic task...

Sid

On Tue, 4 May 2021, 22:28 Tina Friedrich, <tina.friedr...@it.ox.ac.uk<mailto:tina.friedr...@it.ox.ac.uk>> wrote:


    Hello,

    a lot of people already gave very good answer to how to tackle this.

    Still, I thought it worth pointing this out - you said 'you need to
    basically shut down slurm, update the slurm.conf file, then restart'.
    That makes it sound like a major operation with lots of prep required.

    It's not like that at all. Updating slurm.conf is not a major
    operation.

    There's absolutely no reason to shut things down first & then
    change the
    file. You can edit the file / ship out a new version (however you
    like)
    and then restart the daemons.

    The daemons do not have to all be restarted simultaneously. It is
    of no
    consequence if they're running with out-of-sync config files for a
    bit,
    really. (There's a flag you can set if you want to suppress the
    warning
    - 'NO_CONF_HASH' debug flag I think).

    Restarting the dameons (slurmctld, slurmd, ...) is safe. It does not
    require cluster downtime or anything.

    I control slurm.conf using configuration management; the config
    management process restarts the appropriate daemon (slurmctld,
    slurmd,
    slurmdbd) if the file changed. This certainly never happens at the
    same
    time; there's splay in that. It doesn't even necessarily happen on
    the
    controller first, or anything like that.

    What I'm trying to get across - I have a feeling this 'updating the
    cluster wide config file' and 'file must be the same on all nodes'
    is a
    lot less of a procedure (and a lot less strict) than you currently
    imagine it to be :)

    Tina

    On 27/04/2021 19:35, David Henkemeyer wrote:
    > Hello,
    >
    > I'm new to Slurm (coming from PBS), and so I will likely have a few
    > questions over the next several weeks, as I work to transition my
    > infrastructure from PBS to Slurm.
    >
    > My first question has to do with *_adding nodes to Slurm_*. 
    According
    > to the FAQ (and other articles I've read), you need to basically
    shut
    > down slurm, update the slurm.conf file /*on all nodes in the
    cluster*/,
    > then restart slurm.
    >
    > - Why do all nodes need to know about all other nodes? From what
    I have
    > read, its Slurm does a checksum comparison of the slurm.conf
    file across
    > all nodes.  Is this the only reason all nodes need to know about
    all
    > other nodes?
    > - Can I create a symlink that points <sysconfdir>/slurm.conf to a
    > slurm.conf file on an NFS mount point, which is mounted on all the
    > nodes?  This way, I would only need to update a single file, then
    > restart Slurm across the entire cluster.
    > - Any additional help/resources for adding/removing nodes to
    Slurm would
    > be much appreciated.  Perhaps there is a "toolkit" out there to
    automate
    > some of these operations (which is what I already have for PBS,
    and will
    > create for Slurm, if something doesn't already exist).
    >
    > Thank you all,
    >
    > David

--Tina Friedrich, Advanced Research Computing Snr HPC Systems

    Administrator

    Research Computing and Support Services
    IT Services, University of Oxford
    http://www.arc.ox.ac.uk <http://www.arc.ox.ac.uk>
    http://www.it.ox.ac.uk <http://www.it.ox.ac.uk>

Re: [slurm-users] [External] Re: Questions about adding new nodes to Slurm

Reply via email to