[slurm-users] Re: Randomly draining nodes

2024-10-24 Thread Christopher Samuel via slurm-users
Hi Ole, On 10/22/24 11:04 am, Ole Holm Nielsen via slurm-users wrote: Some time ago it was recommended that UnkillableStepTimeout values above 127 (or 256?) should not be used, see https://support.schedmd.com/ show_bug.cgi?id=11103.  I don't know if this restriction is still valid with recent

[slurm-users] Re: Randomly draining nodes

2024-10-23 Thread Ole Holm Nielsen via slurm-users
Hi Chris, Thanks for confirming that UnkillableStepTimeout can have larger values without issues. Do you have some suggestions for values that would safely cover network filesystem delays? Best regards, Ole On 10/24/24 07:51, Christopher Samuel via slurm-users wrote: Some time ago it was re

[slurm-users] Re: Randomly draining nodes

2024-10-22 Thread Ole Holm Nielsen via slurm-users
On 22-10-2024 16:46, Paul Raines via slurm-users wrote: I have a cron job that emails me when hosts go into drain mode and tells me the reason (scontrol show node=$host | grep -i reason) In stead of cron you can also use Slurm triggers, see for example our scripts in the page https://github.c

[slurm-users] Re: Randomly draining nodes

2024-10-22 Thread Paul Raines via slurm-users
I have a cron job that emails me when hosts go into drain mode and tells me the reason (scontrol show node=$host | grep -i reason) We get drains with the "Kill task failed" reason probably about 5 times a day. This despite having UnkillableStepTimeout=180 Right now we are still handling the

[slurm-users] Re: Randomly draining nodes

2024-10-21 Thread Christopher Samuel via slurm-users
On 10/21/24 4:35 am, laddaoui--- via slurm-users wrote: It seems like there's an issue with the termination process on these nodes. Any thoughts on what could be causing this? That usually means processes wedged in the kernel for some reason, in an uninterruptible sleep state. You can define

[slurm-users] Re: Randomly draining nodes

2024-10-21 Thread laddaoui--- via slurm-users
You were right, I found that the slurm.conf file was different between the controller node and the computes, so I've synchronized it now. I was also considering setting up an epilogue script to help debug what happens after the job finishes. Do you happen to have any examples of what an epilogue

[slurm-users] Re: Randomly draining nodes

2024-10-15 Thread Laura Hild via slurm-users
Your slurm.conf should be the same on all machines (is it? you don't have Prolog configured on some but not others?), but no, it is not mandatory to use a prolog. I am simply surprised that you could get a "Prolog error" without having a prolog configured, since an error in the prolog program

[slurm-users] Re: Randomly draining nodes

2024-10-11 Thread laddaoui--- via slurm-users
Hi Laura, Thank you for your reply. Indeed, Prolog is not configured on my machine $ scontrol show config |grep -i prolog Prolog = (null) PrologEpilogTimeout = 65534 PrologSlurmctld = (null) PrologFlags = Alloc,Contain ResvProlog = (null) Sr

[slurm-users] Re: Randomly draining nodes

2024-10-08 Thread Laura Hild via slurm-users
Apologies if I'm missing this in your post, but do you in fact have a Prolog configured in your slurm.conf? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com