Just to followup with some things I’ve tried: scancel doesn’t want to touch it: > # scancel -v 290710 > scancel: Terminating job 290710 > scancel: error: Kill job error on job id 290710: Job/step already completing > or completed
pscontrol does see that these are all members of the same array, but doesn’t want to touch it: > # scontrol update JobID=290710 EndTime=2022-08-09T08:47:01 > 290710_4,6,26,32,60,67,83,87,89,91,...: Job has already finished And trying to modify the job’s end time with sacctmgr fails, as expected, to modify the EndTime because EndTime is only a where spec, not a set spec, also tried EndTime=now with same results: > # sacctmgr modify job where JobID=290710 set EndTime=2022-08-09T08:47:01 > Unknown option: EndTime=2022-08-09T08:47:01 > Use keyword 'where' to modify condition > You didn't give me anything to set I was able to set a comment for the jobs/array, so the DBD can see/talk to them. One additional thing to mention is that there are 14 JIDs that are stuck like this, 1 is an Array JID, and 13 of them are array tasks on the original Array ID. But figured I would provide some of the other steps I’ve tried to flush those ideas. Thanks, Reed > On Dec 20, 2022, at 10:08 AM, Reed Dier <reed.d...@focusvq.com> wrote: > > 2 votes for runawayjobs is a strong vote (and also something I’m glad to > learn exists for the future), however, it does not appear to be the case. > >> # sacctmgr show runawayjobs >> Runaway Jobs: No runaway jobs found on cluster $cluster > > So unfortunately that doesn’t appear to be the culprit. > > Appreciate the responses. > > Reed > >> On Dec 20, 2022, at 10:03 AM, Brian Andrus <toomuc...@gmail.com >> <mailto:toomuc...@gmail.com>> wrote: >> >> Try: >> >> sacctmgr list runawayjobs >> >> Brian Andrus >> >> On 12/20/2022 7:54 AM, Reed Dier wrote: >>> Hoping this is a fairly simple one. >>> >>> This is a small internal cluster that we’ve been using for about 6 months >>> now, and we’ve had some infrastructure instability in that time, which I >>> think may be the root culprit behind this weirdness, but hopefully someone >>> can point me in the direction to solve the issue. >>> >>> I do a daily email of sreport to show how busy the cluster was, and who >>> were the top users. >>> Weirdly, I have a user that seems to be able to use the same exact usage >>> day after day after day, down to hundredth of a percent, conspicuously even >>> when they were on vacation and claimed that they didn’t have job >>> submissions in cron/etc. >>> >>> So then, taking a spin of the scom tui >>> <https://lists.schedmd.com/pipermail/slurm-users/2022-December/009514.html>posted >>> this morning, I then filtered that user, and noticed that even though I >>> was only looking 2 days back at job history, I was seeing a job from August. >>> >>> Conspicuously, the job state is cancelled, but the job end time is 1y from >>> the start time, meaning its job end time is in 2023. >>> So something with the dbd is confused about this/these jobs that are >>> lingering and reporting cancelled but still “on the books” somehow until >>> next August. >>> >>>> ╭──────────────────────────────────────────────────────────────────────────────────────────╮ >>>> │ >>>> │ >>>> │ Job ID : 290742 >>>> │ >>>> │ Job Name : $jobname >>>> │ >>>> │ User : $user >>>> │ >>>> │ Group : $user >>>> │ >>>> │ Job Account : $account >>>> │ >>>> │ Job Submission : 2022-08-08 08:44:52 -0400 EDT >>>> │ >>>> │ Job Start : 2022-08-08 08:46:53 -0400 EDT >>>> │ >>>> │ Job End : 2023-08-08 08:47:01 -0400 EDT >>>> │ >>>> │ Job Wait time : 2m1s >>>> │ >>>> │ Job Run time : 8760h0m8s >>>> │ >>>> │ Partition : $part >>>> │ >>>> │ Priority : 127282 >>>> │ >>>> │ QoS : $qos >>>> │ >>>> │ >>>> │ >>>> │ >>>> │ >>>> ╰──────────────────────────────────────────────────────────────────────────────────────────╯ >>>> Steps count: 0 >>> >>>> Filter: $user Items: 13 >>>> >>>> Job ID Job Name Part. QoS >>>> Account User Nodes State >>>> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── >>>> 290714 $jobname $part $qos $acct >>>> $user node32 CANCELLED >>>> 290716 $jobname $part $qos $acct >>>> $user node24 CANCELLED >>>> 290736 $jobname $part $qos $acct >>>> $user node00 CANCELLED >>>> 290742 $jobname $part $qos $acct >>>> $user node01 CANCELLED >>>> 290770 $jobname $part $qos $acct >>>> $user node02 CANCELLED >>>> 290777 $jobname $part $qos $acct >>>> $user node03 CANCELLED >>>> 290793 $jobname $part $qos $acct >>>> $user node04 CANCELLED >>>> 290797 $jobname $part $qos $acct >>>> $user node05 CANCELLED >>>> 290799 $jobname $part $qos $acct >>>> $user node06 CANCELLED >>>> 290801 $jobname $part $qos $acct >>>> $user node07 CANCELLED >>>> 290814 $jobname $part $qos $acct >>>> $user node08 CANCELLED >>>> 290817 $jobname $part $qos $acct >>>> $user node09 CANCELLED >>>> 290819 $jobname $part $qos $acct >>>> $user node10 CANCELLED >>> >>> >>> I’d love to figure out the proper way to either purge these jid’s from the >>> accounting database cleanly, or change the job end/run time to a >>> sane/correct value. >>> Slurm is v21.08.8-2, and ntp is a stratum 1 server, so time is in sync >>> everywhere, not that multiple servers would drift 1 year off like this. >>> >>> Thanks for any help, >>> Reed >