Re: [slurm-users] Job cancelled into the future

2023-01-19 Thread Reed Dier
Just to hopefully close this out, I believe I was actually able to resolve this in “user-land” rather than mucking with the database. I was able to requeue the bad jid’s, and they went pending. Then I updated the jobs to a time limit of 60. Then I scancelled the jobs, and they returned to a cance

Re: [slurm-users] Job cancelled into the future

2023-01-17 Thread Reed Dier
So I was going to take a stab at trying to rectify this after taking care of post-holiday matters. Paste of the $CLUSTER_job_table table where I think I see the issue, and now I just want to sanity check my steps to remediate. https://rentry.co/qhw6mg (pastebin alterna

Re: [slurm-users] Job cancelled into the future

2022-12-23 Thread Chris Samuel
On 20/12/22 6:01 pm, Brian Andrus wrote: You may want to dump the database, find what table/records need updated and try updating them. If anything went south, you could restore from the dump. +lots to making sure you've got good backups first, and stop slurmdbd before you start on the backu

Re: [slurm-users] Job cancelled into the future

2022-12-20 Thread Brian Andrus
Seems like the time may have been off on the db server at the insert/update. You may want to dump the database, find what table/records need updated and try updating them. If anything went south, you could restore from the dump. Brian Andrus On 12/20/2022 11:51 AM, Reed Dier wrote: Just to

Re: [slurm-users] Job cancelled into the future

2022-12-20 Thread Reed Dier
Just to followup with some things I’ve tried: scancel doesn’t want to touch it: > # scancel -v 290710 > scancel: Terminating job 290710 > scancel: error: Kill job error on job id 290710: Job/step already completing > or completed pscontrol does see that these are all members of the same array, b

Re: [slurm-users] Job cancelled into the future

2022-12-20 Thread Reed Dier
2 votes for runawayjobs is a strong vote (and also something I’m glad to learn exists for the future), however, it does not appear to be the case. > # sacctmgr show runawayjobs > Runaway Jobs: No runaway jobs found on cluster $cluster So unfortunately that doesn’t appear to be the culprit. Appr

Re: [slurm-users] Job cancelled into the future

2022-12-20 Thread Brian Andrus
Try:     sacctmgr list runawayjobs Brian Andrus On 12/20/2022 7:54 AM, Reed Dier wrote: Hoping this is a fairly simple one. This is a small internal cluster that we’ve been using for about 6 months now, and we’ve had some infrastructure instability in that time, which I think may be the roo

Re: [slurm-users] Job cancelled into the future

2022-12-20 Thread Sarlo, Jeffrey S
Do they show up as run away jobs? sacctmgr show runawayjobs If they do, it should give you the option to fix them. Jeff From: slurm-users On Behalf Of Reed Dier Sent: Tuesday, December 20, 2022 9:54 AM To: Slurm User Community List Subject: [slurm-users] Job cancelled into the future Hoping

[slurm-users] Job cancelled into the future

2022-12-20 Thread Reed Dier
Hoping this is a fairly simple one. This is a small internal cluster that we’ve been using for about 6 months now, and we’ve had some infrastructure instability in that time, which I think may be the root culprit behind this weirdness, but hopefully someone can point me in the direction to solv