[slurm-users] scanceling a job puts the node in a draining state

Patrick Goetz Tue, 25 Apr 2023 10:13:35 -0700

Hi -

This was a known bug:  https://bugs.schedmd.com/show_bug.cgi?id=3941


However, the bug report says this was fixed in version 17.02.7.

The problem is we're running version 17.11.2, but appear to still havethis bug going on:

[2023-04-18T17:09:42.482] _slurm_rpc_kill_job: REQUEST_KILL_JOB job163837 uid 38879[2023-04-18T17:09:42.482] email msg to sim...@gmail.com: SLURMJob_id=163837 Name=clip_v3_1view_s3dis_mink_crop_075 Ended, Run time00:37:37, CANCELLED, ExitCode 0[2023-04-18T17:09:45.104] _slurm_rpc_submit_batch_job: JobId=163843InitPrio=43243 usec=267[2023-04-18T17:10:33.057] Resending TERMINATE_JOB request JobId=163837Nodelist=dgx-4[2023-04-18T17:10:48.244] error: slurmd error running JobId=163837 onnode(s)=dgx-4: Kill task failed

[2023-04-18T17:10:48.244] drain_nodes: node dgx-4 state set to DRAIN

[2023-04-18T17:10:53.524] cleanup_completing: job 163837 completionprocess took 71 seconds

That particular node is still in a draining state a week later. Justwondering if I'm missing something.

[slurm-users] scanceling a job puts the node in a draining state

Reply via email to