Re: [slurm-users] Job cancelled into the future

Reed Dier Tue, 20 Dec 2022 11:53:46 -0800

Just to followup with some things I’ve tried:

scancel doesn’t want to touch it:
> # scancel -v 290710
> scancel: Terminating job 290710
> scancel: error: Kill job error on job id 290710: Job/step already completing 
> or completed


pscontrol does see that these are all members of the same array, but doesn’t 
want to touch it:
> # scontrol update JobID=290710 EndTime=2022-08-09T08:47:01
> 290710_4,6,26,32,60,67,83,87,89,91,...: Job has already finished

And trying to modify the job’s end time with sacctmgr fails, as expected, to 
modify the EndTime because EndTime is only a where spec, not a set spec, also 
tried EndTime=now with same results:
> # sacctmgr modify job where JobID=290710 set EndTime=2022-08-09T08:47:01
>  Unknown option: EndTime=2022-08-09T08:47:01
>  Use keyword 'where' to modify condition
>  You didn't give me anything to set

I was able to set a comment for the jobs/array, so the DBD can see/talk to them.
One additional thing to mention is that there are 14 JIDs that are stuck like 
this, 1 is an Array JID, and 13 of them are array tasks on the original Array 
ID.

But figured I would provide some of the other steps I’ve tried to flush those 
ideas.

Thanks,
Reed

> On Dec 20, 2022, at 10:08 AM, Reed Dier <reed.d...@focusvq.com> wrote:
> 
> 2 votes for runawayjobs is a strong vote (and also something I’m glad to 
> learn exists for the future), however, it does not appear to be the case.
> 
>> # sacctmgr show runawayjobs
>> Runaway Jobs: No runaway jobs found on cluster $cluster
> 
> So unfortunately that doesn’t appear to be the culprit.
> 
> Appreciate the responses.
> 
> Reed
> 
>> On Dec 20, 2022, at 10:03 AM, Brian Andrus <toomuc...@gmail.com 
>> <mailto:toomuc...@gmail.com>> wrote:
>> 
>> Try: 
>> 
>>     sacctmgr list runawayjobs
>> 
>> Brian Andrus
>> 
>> On 12/20/2022 7:54 AM, Reed Dier wrote:
>>> Hoping this is a fairly simple one.
>>> 
>>> This is a small internal cluster that we’ve been using for about 6 months 
>>> now, and we’ve had some infrastructure instability in that time, which I 
>>> think may be the root culprit behind this weirdness, but hopefully someone 
>>> can point me in the direction to solve the issue.
>>> 
>>> I do a daily email of sreport to show how busy the cluster was, and who 
>>> were the top users.
>>> Weirdly, I have a user that seems to be able to use the same exact usage 
>>> day after day after day, down to hundredth of a percent, conspicuously even 
>>> when they were on vacation and claimed that they didn’t have job 
>>> submissions in cron/etc.
>>> 
>>> So then, taking a spin of the scom tui  
>>> <https://lists.schedmd.com/pipermail/slurm-users/2022-December/009514.html>posted
>>>  this morning, I then filtered that user, and noticed that even though I 
>>> was only looking 2 days back at job history, I was seeing a job from August.
>>> 
>>> Conspicuously, the job state is cancelled, but the job end time is 1y from 
>>> the start time, meaning its job end time is in 2023.
>>> So something with the dbd is confused about this/these jobs that are 
>>> lingering and reporting cancelled but still “on the books” somehow until 
>>> next August.
>>> 
>>>> ╭──────────────────────────────────────────────────────────────────────────────────────────╮
>>>> │                                                                          
>>>>                 │
>>>> │  Job ID               : 290742                                           
>>>>                 │
>>>> │  Job Name             : $jobname                                         
>>>>                 │
>>>> │  User                 : $user                                            
>>>>                 │
>>>> │  Group                : $user                                            
>>>>                 │
>>>> │  Job Account          : $account                                         
>>>>                 │
>>>> │  Job Submission       : 2022-08-08 08:44:52 -0400 EDT                    
>>>>                 │
>>>> │  Job Start            : 2022-08-08 08:46:53 -0400 EDT                    
>>>>                 │
>>>> │  Job End              : 2023-08-08 08:47:01 -0400 EDT                    
>>>>                 │
>>>> │  Job Wait time        : 2m1s                                             
>>>>                 │
>>>> │  Job Run time         : 8760h0m8s                                        
>>>>                 │
>>>> │  Partition            : $part                                            
>>>>                 │
>>>> │  Priority             : 127282                                           
>>>>                 │
>>>> │  QoS                  : $qos                                             
>>>>                 │
>>>> │                                                                          
>>>>                 │
>>>> │                                                                          
>>>>                 │
>>>> ╰──────────────────────────────────────────────────────────────────────────────────────────╯
>>>> Steps count: 0
>>> 
>>>> Filter: $user         Items: 13
>>>> 
>>>>  Job ID      Job Name                             Part.  QoS         
>>>> Account     User             Nodes                 State
>>>> ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>>>>  290714      $jobname                             $part  $qos        $acct 
>>>>       $user            node32                CANCELLED
>>>>  290716      $jobname                             $part  $qos        $acct 
>>>>       $user            node24                CANCELLED
>>>>  290736      $jobname                             $part  $qos        $acct 
>>>>       $user            node00                CANCELLED
>>>>  290742      $jobname                             $part  $qos        $acct 
>>>>       $user            node01                CANCELLED
>>>>  290770      $jobname                             $part  $qos        $acct 
>>>>       $user            node02                CANCELLED
>>>>  290777      $jobname                             $part  $qos        $acct 
>>>>       $user            node03                CANCELLED
>>>>  290793      $jobname                             $part  $qos        $acct 
>>>>       $user            node04                CANCELLED
>>>>  290797      $jobname                             $part  $qos        $acct 
>>>>       $user            node05                CANCELLED
>>>>  290799      $jobname                             $part  $qos        $acct 
>>>>       $user            node06                CANCELLED
>>>>  290801      $jobname                             $part  $qos        $acct 
>>>>       $user            node07                CANCELLED
>>>>  290814      $jobname                             $part  $qos        $acct 
>>>>       $user            node08                CANCELLED
>>>>  290817      $jobname                             $part  $qos        $acct 
>>>>       $user            node09                CANCELLED
>>>>  290819      $jobname                             $part  $qos        $acct 
>>>>       $user            node10                CANCELLED
>>> 
>>> 
>>> I’d love to figure out the proper way to either purge these jid’s from the 
>>> accounting database cleanly, or change the job end/run time to a 
>>> sane/correct value.
>>> Slurm is v21.08.8-2, and ntp is a stratum 1 server, so time is in sync 
>>> everywhere, not that multiple servers would drift 1 year off like this.
>>> 
>>> Thanks for any help,
>>> Reed
>

Re: [slurm-users] Job cancelled into the future

Reply via email to