Re: [slurm-users] Draining hosts because of failing jobs

Paul Edmon Tue, 04 May 2021 09:41:58 -0700

Since you can run an arbitrary script as a node health checker I mightadd a script that counts failures and then closes if it hits athreshold. The script shouldn't need to talk to the slurmctld orslurmdbd as it should be able to watch the log on the node and see the fail.


-Paul Edmon-


On 5/4/2021 12:09 PM, Gerhard Strangar wrote:

Hello,

how do you implement something like "drain host after 10 consecutive
failed jobs"? Unlike a host check script, that checks for known errors,
I'd like to stop killing jobs just because one node is faulty.

Gerhard

Re: [slurm-users] Draining hosts because of failing jobs

Reply via email to