On 22-10-2024 16:46, Paul Raines via slurm-users wrote:
I have a cron job that emails me when hosts go into drain mode and
tells me the reason (scontrol show node=$host | grep -i reason)

In stead of cron you can also use Slurm triggers, see for example our scripts in the page https://github.com/OleHolmNielsen/Slurm_tools/tree/master/triggers
You can tailor the triggers to do whatever tasks you need.

We get drains with the "Kill task failed" reason probably about 5 times a day. This despite having UnkillableStepTimeout=180

Some time ago it was recommended that UnkillableStepTimeout values above 127 (or 256?) should not be used, see https://support.schedmd.com/show_bug.cgi?id=11103. I don't know if this restriction is still valid with recent versions of Slurm?

Best regards,
Ole



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to