On 22-10-2024 16:46, Paul Raines via slurm-users wrote:
I have a cron job that emails me when hosts go into drain mode and
tells me the reason (scontrol show node=$host | grep -i reason)
In stead of cron you can also use Slurm triggers, see for example our
scripts in the page
https://github.c
I have a cron job that emails me when hosts go into drain mode and
tells me the reason (scontrol show node=$host | grep -i reason)
We get drains with the "Kill task failed" reason probably about 5 times a
day. This despite having UnkillableStepTimeout=180
Right now we are still handling the