2 votes for runawayjobs is a strong vote (and also something I’m glad to learn exists for the future), however, it does not appear to be the case.
> # sacctmgr show runawayjobs > Runaway Jobs: No runaway jobs found on cluster $cluster So unfortunately that doesn’t appear to be the culprit. Appreciate the responses. Reed > On Dec 20, 2022, at 10:03 AM, Brian Andrus <toomuc...@gmail.com> wrote: > > Try: > > sacctmgr list runawayjobs > > Brian Andrus > > On 12/20/2022 7:54 AM, Reed Dier wrote: >> Hoping this is a fairly simple one. >> >> This is a small internal cluster that we’ve been using for about 6 months >> now, and we’ve had some infrastructure instability in that time, which I >> think may be the root culprit behind this weirdness, but hopefully someone >> can point me in the direction to solve the issue. >> >> I do a daily email of sreport to show how busy the cluster was, and who were >> the top users. >> Weirdly, I have a user that seems to be able to use the same exact usage day >> after day after day, down to hundredth of a percent, conspicuously even when >> they were on vacation and claimed that they didn’t have job submissions in >> cron/etc. >> >> So then, taking a spin of the scom tui >> <https://lists.schedmd.com/pipermail/slurm-users/2022-December/009514.html>posted >> this morning, I then filtered that user, and noticed that even though I was >> only looking 2 days back at job history, I was seeing a job from August. >> >> Conspicuously, the job state is cancelled, but the job end time is 1y from >> the start time, meaning its job end time is in 2023. >> So something with the dbd is confused about this/these jobs that are >> lingering and reporting cancelled but still “on the books” somehow until >> next August. >> >>> ╭──────────────────────────────────────────────────────────────────────────────────────────╮ >>> │ >>> │ >>> │ Job ID : 290742 >>> │ >>> │ Job Name : $jobname >>> │ >>> │ User : $user >>> │ >>> │ Group : $user >>> │ >>> │ Job Account : $account >>> │ >>> │ Job Submission : 2022-08-08 08:44:52 -0400 EDT >>> │ >>> │ Job Start : 2022-08-08 08:46:53 -0400 EDT >>> │ >>> │ Job End : 2023-08-08 08:47:01 -0400 EDT >>> │ >>> │ Job Wait time : 2m1s >>> │ >>> │ Job Run time : 8760h0m8s >>> │ >>> │ Partition : $part >>> │ >>> │ Priority : 127282 >>> │ >>> │ QoS : $qos >>> │ >>> │ >>> │ >>> │ >>> │ >>> ╰──────────────────────────────────────────────────────────────────────────────────────────╯ >>> Steps count: 0 >> >>> Filter: $user Items: 13 >>> >>> Job ID Job Name Part. QoS >>> Account User Nodes State >>> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── >>> 290714 $jobname $part $qos $acct >>> $user node32 CANCELLED >>> 290716 $jobname $part $qos $acct >>> $user node24 CANCELLED >>> 290736 $jobname $part $qos $acct >>> $user node00 CANCELLED >>> 290742 $jobname $part $qos $acct >>> $user node01 CANCELLED >>> 290770 $jobname $part $qos $acct >>> $user node02 CANCELLED >>> 290777 $jobname $part $qos $acct >>> $user node03 CANCELLED >>> 290793 $jobname $part $qos $acct >>> $user node04 CANCELLED >>> 290797 $jobname $part $qos $acct >>> $user node05 CANCELLED >>> 290799 $jobname $part $qos $acct >>> $user node06 CANCELLED >>> 290801 $jobname $part $qos $acct >>> $user node07 CANCELLED >>> 290814 $jobname $part $qos $acct >>> $user node08 CANCELLED >>> 290817 $jobname $part $qos $acct >>> $user node09 CANCELLED >>> 290819 $jobname $part $qos $acct >>> $user node10 CANCELLED >> >> >> I’d love to figure out the proper way to either purge these jid’s from the >> accounting database cleanly, or change the job end/run time to a >> sane/correct value. >> Slurm is v21.08.8-2, and ntp is a stratum 1 server, so time is in sync >> everywhere, not that multiple servers would drift 1 year off like this. >> >> Thanks for any help, >> Reed