Hi -

This was a known bug:  https://bugs.schedmd.com/show_bug.cgi?id=3941

However, the bug report says this was fixed in version 17.02.7.

The problem is we're running version 17.11.2, but appear to still have this bug going on:

[2023-04-18T17:09:42.482] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 163837 uid 38879 [2023-04-18T17:09:42.482] email msg to sim...@gmail.com: SLURM Job_id=163837 Name=clip_v3_1view_s3dis_mink_crop_075 Ended, Run time 00:37:37, CANCELLED, ExitCode 0 [2023-04-18T17:09:45.104] _slurm_rpc_submit_batch_job: JobId=163843 InitPrio=43243 usec=267 [2023-04-18T17:10:33.057] Resending TERMINATE_JOB request JobId=163837 Nodelist=dgx-4 [2023-04-18T17:10:48.244] error: slurmd error running JobId=163837 on node(s)=dgx-4: Kill task failed
[2023-04-18T17:10:48.244] drain_nodes: node dgx-4 state set to DRAIN
[2023-04-18T17:10:53.524] cleanup_completing: job 163837 completion process took 71 seconds


That particular node is still in a draining state a week later. Just wondering if I'm missing something.

Reply via email to