Re: [slurm-users] Job cancelled into the future

Reed Dier Tue, 20 Dec 2022 08:10:13 -0800

2 votes for runawayjobs is a strong vote (and also something I’m glad to learn 
exists for the future), however, it does not appear to be the case.


> # sacctmgr show runawayjobs
> Runaway Jobs: No runaway jobs found on cluster $cluster

So unfortunately that doesn’t appear to be the culprit.

Appreciate the responses.

Reed

> On Dec 20, 2022, at 10:03 AM, Brian Andrus <toomuc...@gmail.com> wrote:
> 
> Try: 
> 
>     sacctmgr list runawayjobs
> 
> Brian Andrus
> 
> On 12/20/2022 7:54 AM, Reed Dier wrote:
>> Hoping this is a fairly simple one.
>> 
>> This is a small internal cluster that we’ve been using for about 6 months 
>> now, and we’ve had some infrastructure instability in that time, which I 
>> think may be the root culprit behind this weirdness, but hopefully someone 
>> can point me in the direction to solve the issue.
>> 
>> I do a daily email of sreport to show how busy the cluster was, and who were 
>> the top users.
>> Weirdly, I have a user that seems to be able to use the same exact usage day 
>> after day after day, down to hundredth of a percent, conspicuously even when 
>> they were on vacation and claimed that they didn’t have job submissions in 
>> cron/etc.
>> 
>> So then, taking a spin of the scom tui  
>> <https://lists.schedmd.com/pipermail/slurm-users/2022-December/009514.html>posted
>>  this morning, I then filtered that user, and noticed that even though I was 
>> only looking 2 days back at job history, I was seeing a job from August.
>> 
>> Conspicuously, the job state is cancelled, but the job end time is 1y from 
>> the start time, meaning its job end time is in 2023.
>> So something with the dbd is confused about this/these jobs that are 
>> lingering and reporting cancelled but still “on the books” somehow until 
>> next August.
>> 
>>> ╭──────────────────────────────────────────────────────────────────────────────────────────╮
>>> │                                                                           
>>>                │
>>> │  Job ID               : 290742                                            
>>>                │
>>> │  Job Name             : $jobname                                          
>>>                │
>>> │  User                 : $user                                             
>>>                │
>>> │  Group                : $user                                             
>>>                │
>>> │  Job Account          : $account                                          
>>>                │
>>> │  Job Submission       : 2022-08-08 08:44:52 -0400 EDT                     
>>>                │
>>> │  Job Start            : 2022-08-08 08:46:53 -0400 EDT                     
>>>                │
>>> │  Job End              : 2023-08-08 08:47:01 -0400 EDT                     
>>>                │
>>> │  Job Wait time        : 2m1s                                              
>>>                │
>>> │  Job Run time         : 8760h0m8s                                         
>>>                │
>>> │  Partition            : $part                                             
>>>                │
>>> │  Priority             : 127282                                            
>>>                │
>>> │  QoS                  : $qos                                              
>>>                │
>>> │                                                                           
>>>                │
>>> │                                                                           
>>>                │
>>> ╰──────────────────────────────────────────────────────────────────────────────────────────╯
>>> Steps count: 0
>> 
>>> Filter: $user         Items: 13
>>> 
>>>  Job ID      Job Name                             Part.  QoS         
>>> Account     User             Nodes                 State
>>> ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>>>  290714      $jobname                             $part  $qos        $acct  
>>>      $user            node32                CANCELLED
>>>  290716      $jobname                             $part  $qos        $acct  
>>>      $user            node24                CANCELLED
>>>  290736      $jobname                             $part  $qos        $acct  
>>>      $user            node00                CANCELLED
>>>  290742      $jobname                             $part  $qos        $acct  
>>>      $user            node01                CANCELLED
>>>  290770      $jobname                             $part  $qos        $acct  
>>>      $user            node02                CANCELLED
>>>  290777      $jobname                             $part  $qos        $acct  
>>>      $user            node03                CANCELLED
>>>  290793      $jobname                             $part  $qos        $acct  
>>>      $user            node04                CANCELLED
>>>  290797      $jobname                             $part  $qos        $acct  
>>>      $user            node05                CANCELLED
>>>  290799      $jobname                             $part  $qos        $acct  
>>>      $user            node06                CANCELLED
>>>  290801      $jobname                             $part  $qos        $acct  
>>>      $user            node07                CANCELLED
>>>  290814      $jobname                             $part  $qos        $acct  
>>>      $user            node08                CANCELLED
>>>  290817      $jobname                             $part  $qos        $acct  
>>>      $user            node09                CANCELLED
>>>  290819      $jobname                             $part  $qos        $acct  
>>>      $user            node10                CANCELLED
>> 
>> 
>> I’d love to figure out the proper way to either purge these jid’s from the 
>> accounting database cleanly, or change the job end/run time to a 
>> sane/correct value.
>> Slurm is v21.08.8-2, and ntp is a stratum 1 server, so time is in sync 
>> everywhere, not that multiple servers would drift 1 year off like this.
>> 
>> Thanks for any help,
>> Reed

Re: [slurm-users] Job cancelled into the future

Reply via email to