I’m somewhat confused by your statement "This would only occur if you lower the GrpTRESMins limit after a job has started.”. My test case had the limits established before job submittal and the job was terminated when the threshold was crossed.
Hoot > On Apr 28, 2023, at 6:43 AM, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> > wrote: > > Hi Hoot, > > I'm glad that you have figured out that GrpTRESMins is working as documented > and kills running jobs when the limit is exceeded. This would only occur if > you lower the GrpTRESMins limit after a job has started. > > /Ole > > On 4/27/23 22:39, Hoot Thompson wrote: >> I have GrpTRESMins working and terminating jobs as expected. I was working >> under the belief that the limit “current value” was only updated upon job >> completion. That is not the case, it’s actually updated every 5 minutes it >> appears. If and when the limit/threshold is crossed, jobs are in fact >> canceled. >> Thanks for your help. >>> On Apr 24, 2023, at 1:55 PM, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> >>> wrote: >>> >>> On 24-04-2023 18:33, Hoot Thompson wrote: >>>> In my reading of the Slurm documentation, it seems that exceeding the >>>> limits set in GrpTRESMins should result in terminating a running job. >>>> However, in testing this, The ‘current value’ of the GrpTRESMins only >>>> updates upon job completion and is not updated as the job progresses. >>>> Therefore jobs aren’t being stopped. On the positive side, no new jobs are >>>> started if the limit is exceeded. Here’s the documentation that is >>>> confusing me….. >>> >>> I think the jobs resource usage will only be added to the Slurm database >>> upon job completion. I believe that Slurm doesn't update the resource >>> usage continually as you seem to expect. >>> >>>> If any limit is reached, all running jobs with that TRES in this group >>>> will be killed, and no new jobs will be allowed to run. >>>> Perhaps there is a setting or misconfiguration on my part. >>> >>> The sacctmgr manual page states: >>> >>>> GrpTRESMins=TRES=<minutes>[,TRES=<minutes>,...] >>>> The total number of TRES minutes that can possibly be used by past, >>>> present and future jobs running from this association and its children. >>>> To clear a previously set value use the modify command with a new value of >>>> -1 for each TRES id. >>>> NOTE: This limit is not enforced if set on the root association of a >>>> cluster. So even though it may appear in sacctmgr output, it will not be >>>> enforced. >>>> ALSO NOTE: This limit only applies when using the Priority Multifactor >>>> plugin. The time is decayed using the value of PriorityDecayHalfLife or >>>> PriorityUsageResetPeriod as set in the slurm.conf. When this limit is >>>> reached all associated jobs running will be killed and all future jobs >>>> submitted with associations in the group will be delayed until they are >>>> able to run inside the limit. >>> >>> Can you please confirm that you have configured the "Priority Multifactor" >>> plugin? >>> >>> Your jobs should not be able to start if the user's GrpTRESMins has been >>> exceeded. Hence they won't be killed! >>> >>> Can you explain step by step what you observe? It may be that the above >>> documentation of killing jobs is in error, in which case we should make a >>> bug report to SchedMD. >