Re: [slurm-users] Slurm cannot kill a job which time limit exhausted

Prentice Bisbal Tue, 19 Mar 2019 10:02:01 -0700

Slurm is trying to kill the job that is exceeding it's time limit, butthe job doesn't die, so Slurm marks the node down because it sees thisas a problem with the node. Increasing the value for GraceTime or KillWait might help:

*GraceTime*
    Specifies, in units of seconds, the preemption grace time to be
    extended to a job which has been selected for preemption. The
    default value is zero, no preemption grace time is allowed on this
    partition. Once a job has been selected for preemption, its end
    time is set to the current time plus GraceTime. The job's tasks
    are immediately sent SIGCONT and SIGTERM signals in order to
    provide notification of its imminent termination. This is followed
    by the SIGCONT, SIGTERM and SIGKILL signal sequence upon reaching
    its new end time. This second set of signals is sent to both the
    tasks *and* the containing batch script, if applicable. Meaningful
    only for PreemptMode=CANCEL. See also the global *KillWait*

configuration parameter.

*KillWait*
    The interval, in seconds, given to a job's processes between the
    SIGTERM and SIGKILL signals upon reaching its time limit. If the
    job fails to terminate gracefully in the interval specified, it
    will be forcibly terminated. The default value is 30 seconds. The

value may not exceed 65533.



--
Prentice


On 3/19/19 7:21 AM, Taras Shapovalov wrote:

Hey guys,
When a job max time is exceeded, then Slurm tries to kill the job andfails:
[2019-03-15T09:44:03.589] sched: _slurm_rpc_allocate_resourcesJobId=1325 NodeList=rn003 usec=355[2019-03-15T09:44:03.928] prolog_running_decr: Configuration forJobID=1325 is complete
[2019-03-15T09:45:12.739] Time limit exhausted for JobId=1325
[2019-03-15T09:45:44.001] _slurm_rpc_complete_job_allocation:JobID=1325 State=0x8006 NodeCnt=1 error Job/step already completing orcompleted[2019-03-15T09:46:12.805] Resending TERMINATE_JOB request JobId=1325Nodelist=rn003[2019-03-15T09:48:43.000] update_node: node rn003 reason set to: Killtask failed
[2019-03-15T09:48:43.000] update_node: node rn003 state set to DRAINING
[2019-03-15T09:48:43.000] got (nil)
[2019-03-15T09:48:43.816] cleanup_completing: job 1325 completionprocess took 211 seconds
This happens even on very simple "srun bash" jobs that exceed theirtime limits. Have you idea what does it mean? Upgrade to the latestdid not help.
Best regards,

Taras

Re: [slurm-users] Slurm cannot kill a job which time limit exhausted

Reply via email to