This sounds like a great idea.  My org has been strangely resistent to setting 
up HA for slurm, this might be a good enough reason.  thanks.

Rob
________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Brian 
Andrus <toomuc...@gmail.com>
Sent: Tuesday, January 17, 2023 5:54 PM
To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Maintaining slurm config files for test and 
production clusters



Run a secondary controller.

Do 'scontrol takeover' before any changes, make your changes and restart 
slurmctld on the primary.

If it fails, no harm/no foul, because the secondary is still running happily. 
If it succeeds, it takes control back and you can then restart the secondary 
with the new (known good) config.


Brian Andrus


On 1/17/2023 12:36 PM, Groner, Rob wrote:
So, you have two equal sized clusters, one for test and one for production?  
Our test cluster is a small handful of machines compared to our production.

We have a test slurm control node on a test cluster with a test slurmdbd host 
and test nodes, all named specifically for test.  We don't want a situation 
where our "test" slurm controller node is named the same as our "prod" slurm 
controller node, because the possibility of mistake is too great.  ("I THOUGHT 
I was on the test network....")

Here's the ultimate question I'm trying to get answered....  Does anyone update 
their slurm.conf file on production outside of an outage?  If so, how do you 
KNOW the slurmctld won't barf on some problem in the file you didn't see (even 
a mistaken character in there would do it)?  We're trying to move to a model 
where we don't have downtimes as often, so I need to determine a reliable way 
to continue to add features to slurm without having to wait for the next 
outage.  There's no way I know of to prove the slurm.conf file is good, except 
by feeding it to slurmctld and crossing my fingers.

Rob

________________________________
From: slurm-users 
<slurm-users-boun...@lists.schedmd.com><mailto:slurm-users-boun...@lists.schedmd.com>
 on behalf of Fulcomer, Samuel 
<samuel_fulco...@brown.edu><mailto:samuel_fulco...@brown.edu>
Sent: Wednesday, January 4, 2023 1:54 PM
To: Slurm User Community List 
<slurm-users@lists.schedmd.com><mailto:slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Maintaining slurm config files for test and 
production clusters


You don't often get email from 
samuel_fulco...@brown.edu<mailto:samuel_fulco...@brown.edu>. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>

Just make the cluster names the same, with different Nodename and Partition 
lines. The rest of slurm.conf can be the same. Having two cluster names is only 
necessary if you're running production in a multi-cluster configuration.

Our model has been to have a production cluster and a test cluster which 
becomes the production cluster at yearly upgrade time (for us, next week). The 
test cluster is also used for rebuilding MPI prior to the upgrade, when the PMI 
changes. We force users to resubmit jobs at upgrade time (after the maintenance 
reservation) to ensure that MPI runs correctly.



On Wed, Jan 4, 2023 at 12:26 PM Groner, Rob 
<rug...@psu.edu<mailto:rug...@psu.edu>> wrote:
We currently have a test cluster and a production cluster, all on the same 
network.  We try things on the test cluster, and then we gather those changes 
and make a change to the production cluster.  We're doing that through two 
different repos, but we'd like to have a single repo to make the transition 
from testing configs to publishing them more seamless.  The problem is, of 
course, that the test cluster and production clusters have different cluster 
names, as well as different nodes within them.

Using the include directive, I can pull all of the NodeName lines out of 
slurm.conf and put them into %c-nodes.conf files, one for production, one for 
test.  That still leaves me with two problems:

  *   The clustername itself will still be a problem.  I WANT the same 
slurm.conf file between test and production...but the clustername line will be 
different for them both.  Can I use an env var in that cluster name, because on 
production there could be a different env var value than on test?
  *   The gres.conf file.  I tried using the same "include" trick that works on 
slurm.conf, but it failed because it did not know what the "ClusterName" was.  
I think that means that either it doesn't work for anything other than 
slurm.conf, or that the clustername will have to be defined in gres.conf as 
well?

Any other suggestions of how to keep our slurm files in a single source control 
repo, but still have the flexibility to have them run elegantly on either test 
or production systems?

Thanks.

Reply via email to