Hi Rob,

Slurm doesn’t have a “validate” parameter hence one must know ahead of time 
whether the configuration will work or not.

In answer to your question – yes – on our site the Slurm configuration is 
altered outside of a maintenance window.

Depending upon the potential impact of the change, it will either be made 
silently (no announcement) or users are notified on slack that there maybe a 
brief outage.

Slurm is quite resilient – if slurmctld is down, launching jobs will not happen 
and user commands will fail. But all existing jobs will keep running.

Our users are quite tolerant as well – letting them know when a potential 
change may impact their overall experience of the cluster seems to be 
appreciated.

On our site the configuration files are not changed directly, but moreover a 
template engine is used – our slurm configuration data is in YAML files, which 
are then validated and processed to generate the slurm.conf / nodes.conf / 
partitions.conf / topology.conf

This provides some surety that adding / removing nodes etc. won’t result in an 
inadvertent configuration issue.

We have three clusters (one production, and two test) – all are managed the 
same way.

Finally, using configuration templating it’s possible to spin up new clusters 
quite quickly . . . The longest time is spent picking a new cluster name.

   -Greg

On 17/01/2023, 23:42, "slurm-users" <slurm-users-boun...@lists.schedmd.com> 
wrote:

So, you have two equal sized clusters, one for test and one for production?  
Our test cluster is a small handful of machines compared to our production.

We have a test slurm control node on a test cluster with a test slurmdbd host 
and test nodes, all named specifically for test.  We don't want a situation 
where our "test" slurm controller node is named the same as our "prod" slurm 
controller node, because the possibility of mistake is too great.  ("I THOUGHT 
I was on the test network....")

Here's the ultimate question I'm trying to get answered....  Does anyone update 
their slurm.conf file on production outside of an outage?  If so, how do you 
KNOW the slurmctld won't barf on some problem in the file you didn't see (even 
a mistaken character in there would do it)?  We're trying to move to a model 
where we don't have downtimes as often, so I need to determine a reliable way 
to continue to add features to slurm without having to wait for the next 
outage.  There's no way I know of to prove the slurm.conf file is good, except 
by feeding it to slurmctld and crossing my fingers.

Rob

--

Reply via email to