On 8/3/22 11:47 am, Benjamin Arntzen wrote:

At risk of being a heretic, why not something like Ansible to handle this?

Nothing heretical about that, but for us the reason is that `scontrol reboot ASAP` is integrated nicely into the scheduling of jobs, we have health checks and node epilogs that can recognise certain conditions that require a node reboot (too many fragmented huge pages for instance) and can trigger that automatically without it disrupting scheduling of large jobs.

What used to happen was that when a node was rebooted Slurm would consider it indefinitely unavailable and so think it couldn't schedule a large job and instead pack in smaller jobs, pushing back the start time of the large job.

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


Reply via email to