Adrian Sevcenco <adrian.sevce...@spacescience.ro> writes: > Having just implemented some triggers i just noticed this: > > NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT > AVAIL_FE REASON > alien-0-47 1 alien* draining 48 48:1:1 193324 214030 1 > rack-0,4 Kill task failed > alien-0-56 1 alien* drained 48 48:1:1 193324 214030 1 > rack-0,4 Kill task failed > > i was wondering why a node is drained when killing of task fails
I guess the heuristic is that something is wrong with the node, so it should not run more jobs. Like Disk-waits or similar that might require a reboot. > and how can i disable it? (i use cgroups) I don't know how to disable it, but it can be tuned with: UnkillableStepTimeout The length of time, in seconds, that Slurm will wait before deciding that processes in a job step are unkillable (after they have been signaled with SIGKILL) and execute Unkill‐ ableStepProgram. The default timeout value is 60 seconds. If exceeded, the compute node will be drained to prevent future jobs from being scheduled on the node. (Note though, that according to https://bugs.schedmd.com/show_bug.cgi?id=11103 it should not be set higher than 127 s.) You might also want to look at this setting to find out what is going on on the machine when Slurm cannot kill the job step: UnkillableStepProgram If the processes in a job step are determined to be unkill‐ able for a period of time specified by the UnkillableStepTi‐ meout variable, the program specified by UnkillableStepPro‐ gram will be executed. By default no program is run. See section UNKILLABLE STEP PROGRAM SCRIPT for more informa‐ tion. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo
signature.asc
Description: PGP signature