This sounds similar to something I recently experienced and finally figured out 
in 21.08.

https://lists.schedmd.com/pipermail/slurm-users/2023-January/009594.html 
<https://lists.schedmd.com/pipermail/slurm-users/2023-January/009594.html>

The long and short of it, is that I had jobs with the clock running, even 
though they weren’t showing up in squeue, etc.
I ended up requeueing the jobs, and then cancelling them, and they finally fell 
off the ledger.

Hope thats helpful,
Reed 

> On Feb 3, 2023, at 9:17 AM, Jonathan Casco <jca...@fiu.edu> wrote:
> 
> Hello,
>  
> We are using Slurm 22.05.6 and have encountered a strange issue with one 
> users jobs where they submitted a job array. The jobs failed and left the 
> queue in the logs but have continued to use CPU minutes well past the job 
> completion. I am using one step as an example here but this is occurring for 
> all the steps within job array.
>  
> Below is a snippet from the slurmctld log for one of the job steps in 
> question:
> [2023-01-25T08:36:40.299] sched/backfill: _start_job: Started 
> JobId=8853669_3(8853785) in <partition> on <node>
> [2023-01-25T08:36:40.599] _job_complete: JobId=8853669_3(8853785) WEXITSTATUS 
> 1
> [2023-01-25T08:36:40.601] _job_complete: JobId=8853669_3(8853785) done
>  
> However when checking the job with sacct I see that the end time is Unknown 
> and the job shows as never completed.
> # sacct -j 8853669_3 --format=start%15,end%15,elapsed%20,state%15
>           Start             End              Elapsed           State 
> --------------- --------------- -------------------- --------------- 
> 2023-01-25T08:3         Unknown           9-01:22:21          FAILED 
>  
> One curious bit in this is that the job ID does not appear in the logs of the 
> node where it is said to have run.
>  
> An scancel of the job does not have an effect and we see the following in the 
> logs when attempting to do so:
> [2023-02-03T08:44:36.072] _slurm_rpc_kill_job: REQUEST_KILL_JOB 
> JobId=8853669_3 uid <id>
> [2023-02-03T08:44:36.073] job_str_signal(5): invalid JobId=8853669_3
> [2023-02-03T08:44:36.073] _slurm_rpc_kill_job: job_str_signal() uid=<id> 
> JobId=8853669_3 sig=9 returned: Invalid job id specified
>  
> Checking the database everything looks correct there for the job.
> > select time_start,time_end from job_table where id_job="8853669_3";
> +------------+------------+
> | time_start | time_end   |
> +------------+------------+
> | 1674653930 | 1674653931 |
> +------------+------------+
>  
> Both slurmctld and slurmdbd are running so I am at a bit of a loss on how to 
> proceed with getting this job to “end” to the controller so that it can stop 
> consuming cpuminutes.
>  
> Any help would be appreciated, thanks!

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to