Re: [slurm-users] Job cancelled into the future

Reed Dier Thu, 19 Jan 2023 09:35:37 -0800

Just to hopefully close this out, I believe I was actually able to resolve this 
in “user-land” rather than mucking with the database.


I was able to requeue the bad jid’s, and they went pending.
Then I updated the jobs to a time limit of 60.
Then I scancelled the jobs, and they returned to a cancelled state, before they 
rolled off within about 10 minutes.

Surprised I didn’t think to try requeueing earlier, but here’s to hoping that 
this did the trick, and I will have more accurate reporting and fewer “more 
time than is possible” log errors.

Thanks,
Reed

> On Jan 17, 2023, at 11:29 AM, Reed Dier <reed.d...@focusvq.com> wrote:
> 
> So I was going to take a stab at trying to rectify this after taking care of 
> post-holiday matters.
> 
> Paste of the $CLUSTER_job_table table where I think I see the issue, and now 
> I just want to sanity check my steps to remediate.
> https://rentry.co/qhw6mg <https://rentry.co/qhw6mg> (pastebin alternative 
> because markdown is paywalled for pastebin).
> 
> There are a number of job steps with a timelimit of 4294967295, where as the 
> others of the same job array are 525600.
> Obviously I want to edit those time limits to sane limits (match them to the 
> others).
> I don’t see anything in the $CLUSTER_step_table that looks like it would need 
> to be modified to match, though I could be wrong.
> 
> But then the part of getting slurm to pick it up is where I’m wanting to make 
> sure I’m on the right page.
> Should I manually update the mod_time timestamp and slurm will catch that at 
> its next rollup?
> Or will slurm catch the change in the time limit at update the mod_time when 
> it sees it upon rollup?
> 
> I also don’t see any documentation stating how to manually trigger a rollup, 
> either via slurmdbd.conf or command line flag.
> Will it automagically perform a rollup at some predefined, non-configurable 
> interval, or when restarting the daemon?
> 
> Apologies if this is all trivial information, just trying to measure twice 
> and cut once.
> 
> Appreciate everyone’s help so far.
> 
> Thanks,
> Reed
> 
>> On Dec 23, 2022, at 7:18 PM, Chris Samuel <ch...@csamuel.org 
>> <mailto:ch...@csamuel.org>> wrote:
>> 
>> On 20/12/22 6:01 pm, Brian Andrus wrote:
>> 
>>> You may want to dump the database, find what table/records need updated and 
>>> try updating them. If anything went south, you could restore from the dump.
>> 
>> +lots to making sure you've got good backups first, and stop slurmdbd before 
>> you start on the backups and don't restart it until you've made the changes, 
>> including setting the rollup times to be before the jobs started to make 
>> sure that the rollups include these changes!
>> 
>> When you start slurmdbd after making the changes it should see that it needs 
>> to do rollups and kick those off.
>> 
>> All the best,
>> Chris
>> -- 
>> Chris Samuel  :  http://www.csamuel.org/ <http://www.csamuel.org/>  :  
>> Berkeley, CA, USA

Re: [slurm-users] Job cancelled into the future

Reply via email to