On 8/3/22 11:47 am, Benjamin Arntzen wrote:
At risk of being a heretic, why not something like Ansible to handle this?
Nothing heretical about that, but for us the reason is that `scontrol reboot ASAP` is integrated nicely into the scheduling of jobs, we have health checks and node epilogs that can recognise certain conditions that require a node reboot (too many fragmented huge pages for instance) and can trigger that automatically without it disrupting scheduling of large jobs.
What used to happen was that when a node was rebooted Slurm would consider it indefinitely unavailable and so think it couldn't schedule a large job and instead pack in smaller jobs, pushing back the start time of the large job.
All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA