Hello, I've set up a few nodes on slurm to test with and am having trouble. It seems that once a job has met it's wall time, the node that it ran on enters the comp state then remains in the drain state until I manually set the state to resume.
Looking at the slurm log on the head node, I see the the following relevant entries: [2019-03-15T09:45:12.739] Time limit exhausted for JobId=1325 [2019-03-15T09:45:44.001] _slurm_rpc_complete_job_allocation: JobID=1325 State=0x8006 NodeCnt=1 error Job/step already completing or completed [2019-03-15T09:46:12.805] Resending TERMINATE_JOB request JobId=1325 Nodelist=rn003 [2019-03-15T09:48:43.000] update_node: node rn003 reason set to: Kill task failed [2019-03-15T09:48:43.000] update_node: node rn003 state set to DRAINING [2019-03-15T09:48:43.000] got (ni l) [2019-03-15T09:48:43.816] cleanup_completing: job 1325 completion process took 211 seconds It may be worth mentioning that if I run a job as root and the job hits walltime, the job is killed and the node returns to idle, but if you submit as a non root user, which would be the case in our normal workflow, the node becomes drained once walltime is met. Thank you, -- Eric Rosenberg