Re: [slurm-users] Terminating Jobs based on GrpTRESMins

Hoot Thompson Thu, 27 Apr 2023 13:41:59 -0700

I have GrpTRESMins working and terminating jobs as expected. I was working 
under the belief that the limit “current value” was only updated upon job 
completion. That is not the case, it’s actually updated every 5 minutes it 
appears. If and when the limit/threshold is crossed, jobs are in fact canceled.


Thanks for your help.


> On Apr 24, 2023, at 1:55 PM, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> 
> wrote:
> 
> On 24-04-2023 18:33, Hoot Thompson wrote:
>> In my reading of the Slurm documentation, it seems that exceeding the limits 
>> set in GrpTRESMins should result in terminating a running job. However, in 
>> testing this, The ‘current value’ of the GrpTRESMins only updates upon job 
>> completion and is not updated as the job progresses. Therefore jobs aren’t 
>> being stopped. On the positive side, no new jobs are started if the limit is 
>> exceeded. Here’s the documentation that is confusing me…..
> 
> I think the jobs resource usage will only be added to the Slurm database upon 
> job completion.  I believe that Slurm doesn't update the resource usage 
> continually as you seem to expect.
> 
>> If any limit is reached, all running jobs with that TRES in this group will 
>> be killed, and no new jobs will be allowed to run.
>> Perhaps there is a setting or misconfiguration on my part.
> 
> The sacctmgr manual page states:
> 
>> GrpTRESMins=TRES=<minutes>[,TRES=<minutes>,...]
>> The total number of TRES minutes that can possibly be used by past, present 
>> and future jobs running from this association and its children.  To clear a 
>> previously set value use the modify command with a new value of -1 for each 
>> TRES id.
>> NOTE: This limit is not enforced if set on the root association of a 
>> cluster.  So even though it may appear in sacctmgr output, it will not be 
>> enforced.
>> ALSO NOTE: This limit only applies when using the Priority Multifactor 
>> plugin.  The time is decayed using the value of PriorityDecayHalfLife or 
>> PriorityUsageResetPeriod as set in the slurm.conf.  When this limit is 
>> reached all associated jobs running will be killed and all future jobs 
>> submitted with associations in the group will be delayed until they are able 
>> to run inside the limit.
> 
> Can you please confirm that you have configured the "Priority Multifactor" 
> plugin?
> 
> Your jobs should not be able to start if the user's GrpTRESMins has been 
> exceeded.  Hence they won't be killed!
> 
> Can you explain step by step what you observe?  It may be that the above 
> documentation of killing jobs is in error, in which case we should make a bug 
> report to SchedMD.
> 
> /Ole
> 
> 
> 
>

Re: [slurm-users] Terminating Jobs based on GrpTRESMins

Reply via email to