Hi Tony, On Mon, Oct 21, 2019 at 01:52:21AM +0000, Tony Racho wrote: > Hi: > > We are planning to upgrade our slurm cluster however we plan on NOT doing it > in a one-go. > > We are on 18.08.7 at the moment (db, controller, clients) > > We'd like to do it in a phased approach. > > Stop communication between controller and slurmdbd while updating slurmdbd to > 19.05.X. > > Concurrently, we will update our primary controller to 19.05.X while the > back-up controller will take-over the primary's chores. (and then the back-up > controller will also be upgraded to 19.05.X) > > Once primary controller has been updated to 19.05.X, obviously it assumes > back the cluster but the clients will still be 18.08.7 will there be any > issues with this set-up and consequently if this works, we will choose a > subset of clients and upgrade them to 19.05.X while the others will be on > 18.08.7 until all the clients have been upgraded to 19.05.X. > > My question is will the process/set-up above work? Will the clients still be > able to communicate to the controller without any unintended effect or > issues? Has anyone done this process? > > Once all the controllers and clients are upgraded to 19.05.X, resume > communication between the controllers and the slurmdbd.
We do not have backup controllers at the moment, so I cannot comment on the exact situaton you are in. However, we upgraded from 17.11 to 19.05 as follows. - Update slurmdbd (we first took a dump of the VM (after shutting down slurmdbd and mariadb) and started a new VM from this dump to verify the update would work without any problems -- this turned out to be the case, so we could proceed on the production machine) - For each cluster: - Increase the timeouts for slurmcltd and slurmdb to an hour - Set the partitions to DOWN so no new jobs would start - Backup the slurm spool dir (state save location), just in case - Bring down slurmctld and update the masters - Update slurmd on all the worker nodes; they picked up the running jobs and started chatting to the updated masters - Lower the timeouts again to their original setting Now, if you have a backup controller as well, I suppose you can bring it down at the same time as you update the primary controller? We were told that in principle we could also just update each worker node as they became idle, so have two versions of slurmd running at the same time -- and this is more or less what happened, only the duration of that period was rather short as the nodes were updating sort of at the same time (with some random delays to not overload the repo server). > > While doing the upgrade the following scenario will take place. > > slurmdbd - 19.05.X (but not communicating with the controllers) There is no need to not have it talk to the controllers, I think. When the update of slurmdbd is complete, you can have the controllers talk to it, since they know how to handle incoming data from up to two versions back. When slurmdbd is down, nothing can talk to it :) When it is back up, everything can proceed as normal. Hope this helps a bit, -- Andy
signature.asc
Description: PGP signature