Hello,

We are using Slurm 22.05.6 and have encountered a strange issue with one users 
jobs where they submitted a job array. The jobs failed and left the queue in 
the logs but have continued to use CPU minutes well past the job completion. I 
am using one step as an example here but this is occurring for all the steps 
within job array.

Below is a snippet from the slurmctld log for one of the job steps in question:
[2023-01-25T08:36:40.299] sched/backfill: _start_job: Started 
JobId=8853669_3(8853785) in <partition> on <node>
[2023-01-25T08:36:40.599] _job_complete: JobId=8853669_3(8853785) WEXITSTATUS 1
[2023-01-25T08:36:40.601] _job_complete: JobId=8853669_3(8853785) done

However when checking the job with sacct I see that the end time is Unknown and 
the job shows as never completed.
# sacct -j 8853669_3 --format=start%15,end%15,elapsed%20,state%15
          Start             End              Elapsed           State
--------------- --------------- -------------------- ---------------
2023-01-25T08:3         Unknown           9-01:22:21          FAILED

One curious bit in this is that the job ID does not appear in the logs of the 
node where it is said to have run.

An scancel of the job does not have an effect and we see the following in the 
logs when attempting to do so:
[2023-02-03T08:44:36.072] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=8853669_3 
uid <id>
[2023-02-03T08:44:36.073] job_str_signal(5): invalid JobId=8853669_3
[2023-02-03T08:44:36.073] _slurm_rpc_kill_job: job_str_signal() uid=<id> 
JobId=8853669_3 sig=9 returned: Invalid job id specified

Checking the database everything looks correct there for the job.
> select time_start,time_end from job_table where id_job="8853669_3";
+------------+------------+
| time_start | time_end   |
+------------+------------+
| 1674653930 | 1674653931 |
+------------+------------+

Both slurmctld and slurmdbd are running so I am at a bit of a loss on how to 
proceed with getting this job to “end” to the controller so that it can stop 
consuming cpuminutes.

Any help would be appreciated, thanks!

Reply via email to