I’m somewhat confused by your statement "This would only occur if you lower the
GrpTRESMins limit after a job has started.”. My test case had the limits
established before job submittal and the job was terminated when the threshold
was crossed.
Hoot
> On Apr 28, 2023, at 6:43 AM, Ole Holm Nielsen
> wrote:
>
> Hi Hoot,
>
> I'm glad that you have figured out that GrpTRESMins is working as documented
> and kills running jobs when the limit is exceeded. This would only occur if
> you lower the GrpTRESMins limit after a job has started.
>
> /Ole
>
> On 4/27/23 22:39, Hoot Thompson wrote:
>> I have GrpTRESMins working and terminating jobs as expected. I was working
>> under the belief that the limit “current value” was only updated upon job
>> completion. That is not the case, it’s actually updated every 5 minutes it
>> appears. If and when the limit/threshold is crossed, jobs are in fact
>> canceled.
>> Thanks for your help.
>>> On Apr 24, 2023, at 1:55 PM, Ole Holm Nielsen
>>> wrote:
>>>
>>> On 24-04-2023 18:33, Hoot Thompson wrote:
In my reading of the Slurm documentation, it seems that exceeding the
limits set in GrpTRESMins should result in terminating a running job.
However, in testing this, The ‘current value’ of the GrpTRESMins only
updates upon job completion and is not updated as the job progresses.
Therefore jobs aren’t being stopped. On the positive side, no new jobs are
started if the limit is exceeded. Here’s the documentation that is
confusing me…..
>>>
>>> I think the jobs resource usage will only be added to the Slurm database
>>> upon job completion. I believe that Slurm doesn't update the resource
>>> usage continually as you seem to expect.
>>>
If any limit is reached, all running jobs with that TRES in this group
will be killed, and no new jobs will be allowed to run.
Perhaps there is a setting or misconfiguration on my part.
>>>
>>> The sacctmgr manual page states:
>>>
GrpTRESMins=TRES=[,TRES=,...]
The total number of TRES minutes that can possibly be used by past,
present and future jobs running from this association and its children.
To clear a previously set value use the modify command with a new value of
-1 for each TRES id.
NOTE: This limit is not enforced if set on the root association of a
cluster. So even though it may appear in sacctmgr output, it will not be
enforced.
ALSO NOTE: This limit only applies when using the Priority Multifactor
plugin. The time is decayed using the value of PriorityDecayHalfLife or
PriorityUsageResetPeriod as set in the slurm.conf. When this limit is
reached all associated jobs running will be killed and all future jobs
submitted with associations in the group will be delayed until they are
able to run inside the limit.
>>>
>>> Can you please confirm that you have configured the "Priority Multifactor"
>>> plugin?
>>>
>>> Your jobs should not be able to start if the user's GrpTRESMins has been
>>> exceeded. Hence they won't be killed!
>>>
>>> Can you explain step by step what you observe? It may be that the above
>>> documentation of killing jobs is in error, in which case we should make a
>>> bug report to SchedMD.
>