Re: [slurm-users] Reservation vs. Draining for Maintenance?

Christopher Samuel Thu, 06 Aug 2020 15:58:28 -0700

On 8/6/20 10:13 am, Jason Simms wrote:

Later this month, I will have to bring down, patch, and reboot all nodesin our cluster for maintenance. The two options available to set nodesinto a maintenance mode seem to be either: 1) creating a system-widereservation, or 2) setting all nodes into a DRAIN state.


We use both. :-)

So for cases where we need to do a system wide outage for some reason wewill put reservations on in advance to ensure the system is drained forthe maintenance.

But for rolling upgrades we will build a new image, set nodes to use itand then do something like:


scontrol reboot ASAP nextstate=resume reason="Rolling upgrade" [nodes]

That will allow running jobs to complete, drain all the nodes and whenidle they'll reboot into the new image and resume themselves oncethey're back up and slurmd has started and checked in.

We use the same mechanism when we need to reboot nodes for othermaintenance activities, say when huge pages are too fragmented and theonly way to reclaim them is to reboot the node (these checks happen inthe node epilog).

We paid for enhancements to Slurm 18.08 to ensure that slurmctld tookthese nodes states into account when scheduling jobs so that large jobs(as in requiring most of the nodes in the system) do not lose theirscheduling window when a node has to be rebooted for this reason.


All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Re: [slurm-users] Reservation vs. Draining for Maintenance?

Reply via email to