Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

Christopher Samuel Wed, 03 Aug 2022 22:12:56 -0700

On 8/3/22 11:47 am, Benjamin Arntzen wrote:

At risk of being a heretic, why not something like Ansible to handle this?

Nothing heretical about that, but for us the reason is that `scontrolreboot ASAP` is integrated nicely into the scheduling of jobs, we havehealth checks and node epilogs that can recognise certain conditionsthat require a node reboot (too many fragmented huge pages for instance)and can trigger that automatically without it disrupting scheduling oflarge jobs.

What used to happen was that when a node was rebooted Slurm wouldconsider it indefinitely unavailable and so think it couldn't schedule alarge job and instead pack in smaller jobs, pushing back the start timeof the large job.


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

Reply via email to