Just to hopefully close this out, I believe I was actually able to resolve this in “user-land” rather than mucking with the database.
I was able to requeue the bad jid’s, and they went pending. Then I updated the jobs to a time limit of 60. Then I scancelled the jobs, and they returned to a cancelled state, before they rolled off within about 10 minutes. Surprised I didn’t think to try requeueing earlier, but here’s to hoping that this did the trick, and I will have more accurate reporting and fewer “more time than is possible” log errors. Thanks, Reed > On Jan 17, 2023, at 11:29 AM, Reed Dier <reed.d...@focusvq.com> wrote: > > So I was going to take a stab at trying to rectify this after taking care of > post-holiday matters. > > Paste of the $CLUSTER_job_table table where I think I see the issue, and now > I just want to sanity check my steps to remediate. > https://rentry.co/qhw6mg <https://rentry.co/qhw6mg> (pastebin alternative > because markdown is paywalled for pastebin). > > There are a number of job steps with a timelimit of 4294967295, where as the > others of the same job array are 525600. > Obviously I want to edit those time limits to sane limits (match them to the > others). > I don’t see anything in the $CLUSTER_step_table that looks like it would need > to be modified to match, though I could be wrong. > > But then the part of getting slurm to pick it up is where I’m wanting to make > sure I’m on the right page. > Should I manually update the mod_time timestamp and slurm will catch that at > its next rollup? > Or will slurm catch the change in the time limit at update the mod_time when > it sees it upon rollup? > > I also don’t see any documentation stating how to manually trigger a rollup, > either via slurmdbd.conf or command line flag. > Will it automagically perform a rollup at some predefined, non-configurable > interval, or when restarting the daemon? > > Apologies if this is all trivial information, just trying to measure twice > and cut once. > > Appreciate everyone’s help so far. > > Thanks, > Reed > >> On Dec 23, 2022, at 7:18 PM, Chris Samuel <ch...@csamuel.org >> <mailto:ch...@csamuel.org>> wrote: >> >> On 20/12/22 6:01 pm, Brian Andrus wrote: >> >>> You may want to dump the database, find what table/records need updated and >>> try updating them. If anything went south, you could restore from the dump. >> >> +lots to making sure you've got good backups first, and stop slurmdbd before >> you start on the backups and don't restart it until you've made the changes, >> including setting the rollup times to be before the jobs started to make >> sure that the rollups include these changes! >> >> When you start slurmdbd after making the changes it should see that it needs >> to do rollups and kick those off. >> >> All the best, >> Chris >> -- >> Chris Samuel : http://www.csamuel.org/ <http://www.csamuel.org/> : >> Berkeley, CA, USA