You can push a new conf file and issue an "scontrol reconfigure" on the fly as needed... I do it on our cluster as needed, do the nodes first then login nodes then the slurm controller... you are making a huge issue of a very basic task...
Sid On Tue, 4 May 2021, 22:28 Tina Friedrich, <tina.friedr...@it.ox.ac.uk> wrote: > Hello, > > a lot of people already gave very good answer to how to tackle this. > > Still, I thought it worth pointing this out - you said 'you need to > basically shut down slurm, update the slurm.conf file, then restart'. > That makes it sound like a major operation with lots of prep required. > > It's not like that at all. Updating slurm.conf is not a major operation. > > There's absolutely no reason to shut things down first & then change the > file. You can edit the file / ship out a new version (however you like) > and then restart the daemons. > > The daemons do not have to all be restarted simultaneously. It is of no > consequence if they're running with out-of-sync config files for a bit, > really. (There's a flag you can set if you want to suppress the warning > - 'NO_CONF_HASH' debug flag I think). > > Restarting the dameons (slurmctld, slurmd, ...) is safe. It does not > require cluster downtime or anything. > > I control slurm.conf using configuration management; the config > management process restarts the appropriate daemon (slurmctld, slurmd, > slurmdbd) if the file changed. This certainly never happens at the same > time; there's splay in that. It doesn't even necessarily happen on the > controller first, or anything like that. > > What I'm trying to get across - I have a feeling this 'updating the > cluster wide config file' and 'file must be the same on all nodes' is a > lot less of a procedure (and a lot less strict) than you currently > imagine it to be :) > > Tina > > On 27/04/2021 19:35, David Henkemeyer wrote: > > Hello, > > > > I'm new to Slurm (coming from PBS), and so I will likely have a few > > questions over the next several weeks, as I work to transition my > > infrastructure from PBS to Slurm. > > > > My first question has to do with *_adding nodes to Slurm_*. According > > to the FAQ (and other articles I've read), you need to basically shut > > down slurm, update the slurm.conf file /*on all nodes in the cluster*/, > > then restart slurm. > > > > - Why do all nodes need to know about all other nodes? From what I have > > read, its Slurm does a checksum comparison of the slurm.conf file across > > all nodes. Is this the only reason all nodes need to know about all > > other nodes? > > - Can I create a symlink that points <sysconfdir>/slurm.conf to a > > slurm.conf file on an NFS mount point, which is mounted on all the > > nodes? This way, I would only need to update a single file, then > > restart Slurm across the entire cluster. > > - Any additional help/resources for adding/removing nodes to Slurm would > > be much appreciated. Perhaps there is a "toolkit" out there to automate > > some of these operations (which is what I already have for PBS, and will > > create for Slurm, if something doesn't already exist). > > > > Thank you all, > > > > David > > -- > Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator > > Research Computing and Support Services > IT Services, University of Oxford > http://www.arc.ox.ac.uk http://www.it.ox.ac.uk > >