This sounds similar to something I recently experienced and finally figured out in 21.08.
https://lists.schedmd.com/pipermail/slurm-users/2023-January/009594.html <https://lists.schedmd.com/pipermail/slurm-users/2023-January/009594.html> The long and short of it, is that I had jobs with the clock running, even though they weren’t showing up in squeue, etc. I ended up requeueing the jobs, and then cancelling them, and they finally fell off the ledger. Hope thats helpful, Reed > On Feb 3, 2023, at 9:17 AM, Jonathan Casco <jca...@fiu.edu> wrote: > > Hello, > > We are using Slurm 22.05.6 and have encountered a strange issue with one > users jobs where they submitted a job array. The jobs failed and left the > queue in the logs but have continued to use CPU minutes well past the job > completion. I am using one step as an example here but this is occurring for > all the steps within job array. > > Below is a snippet from the slurmctld log for one of the job steps in > question: > [2023-01-25T08:36:40.299] sched/backfill: _start_job: Started > JobId=8853669_3(8853785) in <partition> on <node> > [2023-01-25T08:36:40.599] _job_complete: JobId=8853669_3(8853785) WEXITSTATUS > 1 > [2023-01-25T08:36:40.601] _job_complete: JobId=8853669_3(8853785) done > > However when checking the job with sacct I see that the end time is Unknown > and the job shows as never completed. > # sacct -j 8853669_3 --format=start%15,end%15,elapsed%20,state%15 > Start End Elapsed State > --------------- --------------- -------------------- --------------- > 2023-01-25T08:3 Unknown 9-01:22:21 FAILED > > One curious bit in this is that the job ID does not appear in the logs of the > node where it is said to have run. > > An scancel of the job does not have an effect and we see the following in the > logs when attempting to do so: > [2023-02-03T08:44:36.072] _slurm_rpc_kill_job: REQUEST_KILL_JOB > JobId=8853669_3 uid <id> > [2023-02-03T08:44:36.073] job_str_signal(5): invalid JobId=8853669_3 > [2023-02-03T08:44:36.073] _slurm_rpc_kill_job: job_str_signal() uid=<id> > JobId=8853669_3 sig=9 returned: Invalid job id specified > > Checking the database everything looks correct there for the job. > > select time_start,time_end from job_table where id_job="8853669_3"; > +------------+------------+ > | time_start | time_end | > +------------+------------+ > | 1674653930 | 1674653931 | > +------------+------------+ > > Both slurmctld and slurmdbd are running so I am at a bit of a loss on how to > proceed with getting this job to “end” to the controller so that it can stop > consuming cpuminutes. > > Any help would be appreciated, thanks!
smime.p7s
Description: S/MIME cryptographic signature