[slurm-users] Re: Randomly draining nodes

2024-10-22 Thread Ole Holm Nielsen via slurm-users
On 22-10-2024 16:46, Paul Raines via slurm-users wrote: I have a cron job that emails me when hosts go into drain mode and tells me the reason (scontrol show node=$host | grep -i reason) In stead of cron you can also use Slurm triggers, see for example our scripts in the page https://github.c

[slurm-users] Re: Randomly draining nodes

2024-10-22 Thread Paul Raines via slurm-users
I have a cron job that emails me when hosts go into drain mode and tells me the reason (scontrol show node=$host | grep -i reason) We get drains with the "Kill task failed" reason probably about 5 times a day. This despite having UnkillableStepTimeout=180 Right now we are still handling the