Hi Reed,

Thank you for that information. I gave the requeue a try however it did not 
work as the scheduler did not recognize the job ID.
# scontrol requeue 8853669_3
8853669_3: Invalid job id specified

I tried with a few other job steps but saw the same error. It looks like the 
scheduler is not in agreement with the database over this batch of jobs which 
is odd. A restart of the daemons did not do the trick either unfortunately.

From: Reed Dier <reed.d...@focusvq.com>
Date: Friday, February 3, 2023 at 1:08 PM
To: Jonathan Casco <jca...@fiu.edu>
Cc: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Job continuing to use cpu minutes after completion
This sounds similar to something I recently experienced and finally figured out 
in 21.08.

https://lists.schedmd.com/pipermail/slurm-users/2023-January/009594.html

The long and short of it, is that I had jobs with the clock running, even 
though they weren’t showing up in squeue, etc.
I ended up requeueing the jobs, and then cancelling them, and they finally fell 
off the ledger.

Hope thats helpful,
Reed


On Feb 3, 2023, at 9:17 AM, Jonathan Casco 
<jca...@fiu.edu<mailto:jca...@fiu.edu>> wrote:

Hello,

We are using Slurm 22.05.6 and have encountered a strange issue with one users 
jobs where they submitted a job array. The jobs failed and left the queue in 
the logs but have continued to use CPU minutes well past the job completion. I 
am using one step as an example here but this is occurring for all the steps 
within job array.

Below is a snippet from the slurmctld log for one of the job steps in question:
[2023-01-25T08:36:40.299] sched/backfill: _start_job: Started 
JobId=8853669_3(8853785) in <partition> on <node>
[2023-01-25T08:36:40.599] _job_complete: JobId=8853669_3(8853785) WEXITSTATUS 1
[2023-01-25T08:36:40.601] _job_complete: JobId=8853669_3(8853785) done

However when checking the job with sacct I see that the end time is Unknown and 
the job shows as never completed.
# sacct -j 8853669_3 --format=start%15,end%15,elapsed%20,state%15
          Start             End              Elapsed           State
--------------- --------------- -------------------- ---------------
2023-01-25T08:3         Unknown           9-01:22:21          FAILED

One curious bit in this is that the job ID does not appear in the logs of the 
node where it is said to have run.

An scancel of the job does not have an effect and we see the following in the 
logs when attempting to do so:
[2023-02-03T08:44:36.072] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=8853669_3 
uid <id>
[2023-02-03T08:44:36.073] job_str_signal(5): invalid JobId=8853669_3
[2023-02-03T08:44:36.073] _slurm_rpc_kill_job: job_str_signal() uid=<id> 
JobId=8853669_3 sig=9 returned: Invalid job id specified

Checking the database everything looks correct there for the job.
> select time_start,time_end from job_table where id_job="8853669_3";
+------------+------------+
| time_start | time_end   |
+------------+------------+
| 1674653930 | 1674653931 |
+------------+------------+

Both slurmctld and slurmdbd are running so I am at a bit of a loss on how to 
proceed with getting this job to “end” to the controller so that it can stop 
consuming cpuminutes.

Any help would be appreciated, thanks!

Reply via email to