Re: [slurm-users] Terminating Jobs based on GrpTRESMins

Hoot Thompson Fri, 28 Apr 2023 09:31:16 -0700

I’m somewhat confused by your statement "This would only occur if you lower the 
GrpTRESMins limit after a job has started.”. My test case had the limits 
established before job submittal and the job was terminated when the threshold 
was crossed.


Hoot

> On Apr 28, 2023, at 6:43 AM, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> 
> wrote:
> 
> Hi Hoot,
> 
> I'm glad that you have figured out that GrpTRESMins is working as documented 
> and kills running jobs when the limit is exceeded.  This would only occur if 
> you lower the GrpTRESMins limit after a job has started.
> 
> /Ole
> 
> On 4/27/23 22:39, Hoot Thompson wrote:
>> I have GrpTRESMins working and terminating jobs as expected. I was working 
>> under the belief that the limit “current value” was only updated upon job 
>> completion. That is not the case, it’s actually updated every 5 minutes it 
>> appears. If and when the limit/threshold is crossed, jobs are in fact 
>> canceled.
>> Thanks for your help.
>>> On Apr 24, 2023, at 1:55 PM, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> 
>>> wrote:
>>> 
>>> On 24-04-2023 18:33, Hoot Thompson wrote:
>>>> In my reading of the Slurm documentation, it seems that exceeding the 
>>>> limits set in GrpTRESMins should result in terminating a running job. 
>>>> However, in testing this, The ‘current value’ of the GrpTRESMins only 
>>>> updates upon job completion and is not updated as the job progresses. 
>>>> Therefore jobs aren’t being stopped. On the positive side, no new jobs are 
>>>> started if the limit is exceeded. Here’s the documentation that is 
>>>> confusing me…..
>>> 
>>> I think the jobs resource usage will only be added to the Slurm database 
>>> upon job completion.  I believe that Slurm doesn't update the resource 
>>> usage continually as you seem to expect.
>>> 
>>>> If any limit is reached, all running jobs with that TRES in this group 
>>>> will be killed, and no new jobs will be allowed to run.
>>>> Perhaps there is a setting or misconfiguration on my part.
>>> 
>>> The sacctmgr manual page states:
>>> 
>>>> GrpTRESMins=TRES=<minutes>[,TRES=<minutes>,...]
>>>> The total number of TRES minutes that can possibly be used by past, 
>>>> present and future jobs running from this association and its children.  
>>>> To clear a previously set value use the modify command with a new value of 
>>>> -1 for each TRES id.
>>>> NOTE: This limit is not enforced if set on the root association of a 
>>>> cluster.  So even though it may appear in sacctmgr output, it will not be 
>>>> enforced.
>>>> ALSO NOTE: This limit only applies when using the Priority Multifactor 
>>>> plugin.  The time is decayed using the value of PriorityDecayHalfLife or 
>>>> PriorityUsageResetPeriod as set in the slurm.conf.  When this limit is 
>>>> reached all associated jobs running will be killed and all future jobs 
>>>> submitted with associations in the group will be delayed until they are 
>>>> able to run inside the limit.
>>> 
>>> Can you please confirm that you have configured the "Priority Multifactor" 
>>> plugin?
>>> 
>>> Your jobs should not be able to start if the user's GrpTRESMins has been 
>>> exceeded.  Hence they won't be killed!
>>> 
>>> Can you explain step by step what you observe?  It may be that the above 
>>> documentation of killing jobs is in error, in which case we should make a 
>>> bug report to SchedMD.
>

Re: [slurm-users] Terminating Jobs based on GrpTRESMins

Reply via email to