On 3/4/19 2:26 PM, Loris Bennett wrote:
Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> writes:
We're one of the many Slurm sites which run the slurmdbd database daemon on the
same server as the slurmctld daemon.  This works without problems at our site
given our modest load, however, SchedMD recommends to run the daemons on
separate servers.

Contemplating how to upgrade our cluster from Slurm 17.11 to 18.08, I've come to
appreciate the advantage of running the daemons on separate servers: One can
upgrade slurmdbd to 18.08 while keeping slurmctld at 17.11 (for a while at
least).  This enables us to upgrade to 18.08 in the recommended order without
any interruption to our running jobs and without any cluster downtime.

Can't one do this even with only one server?  We have always run both
slurmctld and slurmdbd on one machine and have performed all the updates
without any downtime.

For minor upgrade 17.11.x to 17.11.y there is no issue because the MySQL database layout is unchanged.

Major upgrades such as 17.11 to 18.08 is potentially more risky, see for example this list thread "Extreme long db upgrade 16.05.6 -> 17.11.3":
https://lists.schedmd.com/pipermail/slurm-users/2018-February/000612.html

I recommend to study the instructions in https://slurm.schedmd.com/quickstart_admin.html#upgrade.

See also the slides on "Upgrading" in https://slurm.schedmd.com/SLUG18/field_notes2.pdf from the SLUG meeting 2018 (https://slurm.schedmd.com/publications.html).

Updating the database layout during a Slurm major upgrade can in special situations lead to problems, so it's safer to do the upgrade separately for slurmdbd and slurmctld. This is why I've decided to move my slurmdbd and database to a separate server now. The slurmctld which governs the entire cluster is thereby unaffected as I "play" with the database upgrade, and I can upgrade Slurm without any cluster downtime.

I've tested the actual database upgrade on a test server to verify that it works without problems and to estimate the downtime expected for slurmdbd. Our database was converted in less than 5 minutes.

I've also purged a lot of old database records for job steps etc. prior to moving the database, and this reduced our database by a factor of 3. See my notes in https://wiki.fysik.dtu.dk/niflheim/Slurm_database#setting-database-purge-parameters

/Ole

I've been collecting various pieces of information about Slurm upgrades and I've
come up with a tested procedure for migrating the slurmdbd service (on a
CentOS/RHEL 7 system) to a new server:

https://wiki.fysik.dtu.dk/niflheim/Slurm_database#migrate-the-slurmdbd-service-to-another-server

The basic idea is that slurmctld continues happily while slurmdbd is down, so
you can migrate the MySQL database and slurmdbd behind the scenes.  When the new
slurmdbd server is up and running, you reconfigure slurm.conf on the cluster.

Upgrading slurmctld and slurmd is another topic, and this is discussed in my
Wiki page https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm.

I'd appreciate comments and suggestions about my procedure.

Reply via email to