Hey guys, When a job max time is exceeded, then Slurm tries to kill the job and fails:
[2019-03-15T09:44:03.589] sched: _slurm_rpc_allocate_resources JobId=1325 NodeList=rn003 usec=355 [2019-03-15T09:44:03.928] prolog_running_decr: Configuration for JobID=1325 is complete [2019-03-15T09:45:12.739] Time limit exhausted for JobId=1325 [2019-03-15T09:45:44.001] _slurm_rpc_complete_job_allocation: JobID=1325 State=0x8006 NodeCnt=1 error Job/step already completing or completed [2019-03-15T09:46:12.805] Resending TERMINATE_JOB request JobId=1325 Nodelist=rn003 [2019-03-15T09:48:43.000] update_node: node rn003 reason set to: Kill task failed [2019-03-15T09:48:43.000] update_node: node rn003 state set to DRAINING [2019-03-15T09:48:43.000] got (nil) [2019-03-15T09:48:43.816] cleanup_completing: job 1325 completion process took 211 seconds This happens even on very simple "srun bash" jobs that exceed their time limits. Have you idea what does it mean? Upgrade to the latest did not help. Best regards, Taras