[slurm-users] Job cancelled into the future

Reed Dier Tue, 20 Dec 2022 07:57:28 -0800

Hoping this is a fairly simple one.

This is a small internal cluster that we’ve been using for about 6 months now, 
and we’ve had some infrastructure instability in that time, which I think may 
be the root culprit behind this weirdness, but hopefully someone can point me 
in the direction to solve the issue.


I do a daily email of sreport to show how busy the cluster was, and who were 
the top users.
Weirdly, I have a user that seems to be able to use the same exact usage day 
after day after day, down to hundredth of a percent, conspicuously even when 
they were on vacation and claimed that they didn’t have job submissions in 
cron/etc.

So then, taking a spin of the scom tui  
<https://lists.schedmd.com/pipermail/slurm-users/2022-December/009514.html>posted
 this morning, I then filtered that user, and noticed that even though I was 
only looking 2 days back at job history, I was seeing a job from August.

Conspicuously, the job state is cancelled, but the job end time is 1y from the 
start time, meaning its job end time is in 2023.
So something with the dbd is confused about this/these jobs that are lingering 
and reporting cancelled but still “on the books” somehow until next August.

> ╭──────────────────────────────────────────────────────────────────────────────────────────╮
> │                                                                             
>              │
> │  Job ID               : 290742                                              
>              │
> │  Job Name             : $jobname                                            
>              │
> │  User                 : $user                                               
>              │
> │  Group                : $user                                               
>              │
> │  Job Account          : $account                                            
>              │
> │  Job Submission       : 2022-08-08 08:44:52 -0400 EDT                       
>              │
> │  Job Start            : 2022-08-08 08:46:53 -0400 EDT                       
>              │
> │  Job End              : 2023-08-08 08:47:01 -0400 EDT                       
>              │
> │  Job Wait time        : 2m1s                                                
>              │
> │  Job Run time         : 8760h0m8s                                           
>              │
> │  Partition            : $part                                               
>              │
> │  Priority             : 127282                                              
>              │
> │  QoS                  : $qos                                                
>              │
> │                                                                             
>              │
> │                                                                             
>              │
> ╰──────────────────────────────────────────────────────────────────────────────────────────╯
> Steps count: 0

> Filter: $user         Items: 13
> 
>  Job ID      Job Name                             Part.  QoS         Account  
>    User             Nodes                 State
> ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>  290714      $jobname                             $part  $qos        $acct    
>    $user            node32                CANCELLED
>  290716      $jobname                             $part  $qos        $acct    
>    $user            node24                CANCELLED
>  290736      $jobname                             $part  $qos        $acct    
>    $user            node00                CANCELLED
>  290742      $jobname                             $part  $qos        $acct    
>    $user            node01                CANCELLED
>  290770      $jobname                             $part  $qos        $acct    
>    $user            node02                CANCELLED
>  290777      $jobname                             $part  $qos        $acct    
>    $user            node03                CANCELLED
>  290793      $jobname                             $part  $qos        $acct    
>    $user            node04                CANCELLED
>  290797      $jobname                             $part  $qos        $acct    
>    $user            node05                CANCELLED
>  290799      $jobname                             $part  $qos        $acct    
>    $user            node06                CANCELLED
>  290801      $jobname                             $part  $qos        $acct    
>    $user            node07                CANCELLED
>  290814      $jobname                             $part  $qos        $acct    
>    $user            node08                CANCELLED
>  290817      $jobname                             $part  $qos        $acct    
>    $user            node09                CANCELLED
>  290819      $jobname                             $part  $qos        $acct    
>    $user            node10                CANCELLED


I’d love to figure out the proper way to either purge these jid’s from the 
accounting database cleanly, or change the job end/run time to a sane/correct 
value.
Slurm is v21.08.8-2, and ntp is a stratum 1 server, so time is in sync 
everywhere, not that multiple servers would drift 1 year off like this.

Thanks for any help,
Reed

[slurm-users] Job cancelled into the future

Reply via email to