I did already see the upgrade section of Jason's talk, but it wasn't much about 
the mechanics of the actual upgrade process, more of a big picture it seemed.  
It dealt a lot with different parts of slurm at different versions, which is 
something we don't have.

One little wrinkle here is that while, yes, we're using a symlink to point to 
what version of slurm is the current one...it's all on a shared filesystem.  
So, ALL nodes, slurmdb, slurmctld are using that same symlink.  There is no 
means to upgrade one component at a time.  That means to upgrade, EVERYTHING 
has to come down before it could come back up.  Jason's slides seemed to 
indicate that, if there were separate symlinks, then I could focus on just the 
slurmdb first and upgrade it...then focus on slurmctld and upgrade it, and then 
finally the nodes (take down their slurmd, upgrade the link, bring up slurmd).  
So maybe that's what I'm missing.

Otherwise, I think what I'm saying is that I see references to a "rolling 
upgrade", but I don't see any guide to a rolling upgrade.  I just see the 14 
steps  in https://slurm.schedmd.com/quickstart_admin.html#upgrade, and I guess 
I'd always thought of that as the full octane, high fat upgrade.  I've only 
ever done upgrades during one of our many scheduled downtimes, because the 
upgrades were always to a new major version, and because I'm a scared little 
chicken, so I figured there were maybe some smaller subset of steps if only 
upgrading a patchlevel change.  Smaller change, less risk, less precautionary 
steps...?  I'm seeing now that's not the case.

Thank you all for the suggestions!

Rob


________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Ryan 
Novosielski <novos...@rutgers.edu>
Sent: Friday, September 29, 2023 2:48 AM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Steps to upgrade slurm for a patchlevel change?

You don't often get email from novos...@rutgers.edu. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>
I started off writing there’s really no particular process for these/just do 
your changes and start the new software (be mindful of any PATH that might 
contain data that’s under your software tree, if you have that setup), and that 
you might need to watch the timeouts, but I figured I’d have a look at the 
upgrade guide to be sure.

There’s really nothing onerous in there. I’d personally back up my database and 
state save directories just because I’d rather be safe than sorry, or for if 
have to go backwards and want to be sure. You can run SlurmCtld for a good 
while with no database (note that -M on the command line will be broken during 
that time), just being mindful of the RAM on the SlurmCtld machine/don’t 
restart it before the DB is back up, and backing up our fairly large database 
doesn’t take all that long. Whether or not 5 is required mostly depends on how 
long you think it will take you to do 6-11 (which could really take you seconds 
if your process is really as simple as stop, change symlink, start), 12 you’re 
going to do no matter what, 13 you don’t need if you skipped 5, and 14 is up to 
you. So practically, that’s what you’re going to do anyway.

We just did an upgrade last week, and the only difference is that our compute 
nodes are stateless, so the compute node upgrades were a reboot (we could 
upgrade them running, but we did it during a maintenance period anyway, so 
why?).

If you want to do this with running jobs, I’d definitely back up the state save 
directory, but as long as you watch the timeouts, it’s pretty uneventful. You 
won’t have that long database upgrade period, since no database modifications 
will be required, so it’s pretty much like upgrading anything else.

--
#BlackLivesMatter
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
     `'

On Sep 28, 2023, at 11:58, Groner, Rob <rug...@psu.edu> wrote:


There's 14 steps to upgrading slurm listed on their website, including shutting 
down and backing up the database.  So far we've only updated slurm during a 
downtime, and it's been a major version change, so we've taken all the steps 
indicated.

We now want to upgrade from 23.02.4 to 23.02.5.

Our slurm builds end up in version named directories, and we tell production 
which one to use via symlink.  Changing the symlink will automatically change 
it on our slurm controller node and all slurmd nodes.

Is there an expedited, simple, slimmed down upgrade path to follow if we're 
looking at just a . level upgrade?

Rob

Reply via email to