[slurm-users] Re: Randomly draining nodes

2024-10-22 Thread Ole Holm Nielsen via slurm-users

On 22-10-2024 16:46, Paul Raines via slurm-users wrote:

I have a cron job that emails me when hosts go into drain mode and
tells me the reason (scontrol show node=$host | grep -i reason)


In stead of cron you can also use Slurm triggers, see for example our 
scripts in the page 
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/triggers

You can tailor the triggers to do whatever tasks you need.

We get drains with the "Kill task failed" reason probably about 5 times a day.  This despite having UnkillableStepTimeout=180 


Some time ago it was recommended that UnkillableStepTimeout values above 
127 (or 256?) should not be used, see 
https://support.schedmd.com/show_bug.cgi?id=11103.  I don't know if this 
restriction is still valid with recent versions of Slurm?


Best regards,
Ole



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Randomly draining nodes

2024-10-22 Thread Paul Raines via slurm-users



I have a cron job that emails me when hosts go into drain mode and
tells me the reason (scontrol show node=$host | grep -i reason)

We get drains with the "Kill task failed" reason probably about 5 times a 
day.  This despite having UnkillableStepTimeout=180


Right now we are still handling them manually by sshing to the node
and running a script we wrote called clean_cgroup_jobs that looks
for the unkilled processes using the cgroup info for the job

If it finds none, it deletes the cgroups for the job and we resume
the node.  This is true about 95% of the time.

In the case of a truly unkillable process, it lists the process and then 
we manually investigate.  Often in this case it is hung NFS mount causing 
the problem and we have various ways of dealing with that that can involve 
faking the IP of the offline NFS server on another server to make the node 
client nfs kernel process finally exit.


In the rare case we can not find a way to kill the unkillable process
we arrange to reboot the node.


-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Tue, 22 Oct 2024 12:59am, Christopher Samuel via slurm-users wrote:

   External Email - Use Caution 


On 10/21/24 4:35 am, laddaoui--- via slurm-users wrote:


 It seems like there's an issue with the termination process on these
 nodes. Any thoughts on what could be causing this?


That usually means processes wedged in the kernel for some reason, in an 
uninterruptible sleep state. You can define an "UnkillableStepProgram" to be 
run on the node when that happens to capture useful state info. You can do 
that by doing things like iterating through processes in the jobs cgroup 
dumping their `/proc/$PID/stack` somewhere useful, getting the `ps` info for 
all those same processes, and/or doing an `echo w > /proc/sysrq-trigger` to 
make the kernel dump all blocked tasks.


All the best,
Chris
--
Chris Samuel  : 
http://secure-web.cisco.com/1nkj9AvGGR14KG_wv9PtKtCMW_eu_C_6GKksFtwzqIHnSnp9zBgBvF7UhDjX-Jr7rqntHijweFQC7Dr7OXLSBQL4QFJp08bow0Lq85rerK08C4tM9f1oLt8ZQw6024ThBhY-70OkfJeXC0vq8ErlLvw1M5SaiHScDnTVcvn1rXM4mXMWmaQLMRYYU_RBeHMar_VYV_5G1mgOQvtXsieR8EA9iW2Oh1G9gYhzPFIteEobjgzdvVkcmLAwnqvhoXv_eu6jGAfseh5fOIkdD3Rd0vqGyMj-D3m8kFtGZuUZ5rEi3eRIYWlnNkiSIBBHm8BYw/http%3A%2F%2Fwww.csamuel.org%2F 
:  Berkeley, CA, USA


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com





The information in this e-mail is intended only for the person to whom it is 
addressed.  If you believe this e-mail was sent to you in error and the e-mail 
contains patient information, please contact the Mass General Brigham Compliance 
HelpLine at https://www.massgeneralbrigham.org/complianceline 
 .
Please note that this e-mail is not secure (encrypted).  If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately.  Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail. 



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com