Slurm is trying to kill the job that is exceeding it's time limit, but
the job doesn't die, so Slurm marks the node down because it sees this
as a problem with the node. Increasing the value for GraceTime orĀ
KillWait might help:
*GraceTime*
Specifies, in units of seconds, the preemption
Hey guys,
When a job max time is exceeded, then Slurm tries to kill the job and fails:
[2019-03-15T09:44:03.589] sched: _slurm_rpc_allocate_resources JobId=1325
NodeList=rn003 usec=355
[2019-03-15T09:44:03.928] prolog_running_decr: Configuration for JobID=1325
is complete
[2019-03-15T09:45:12.739