I have GrpTRESMins working and terminating jobs as expected. I was working under the belief that the limit “current value” was only updated upon job completion. That is not the case, it’s actually updated every 5 minutes it appears. If and when the limit/threshold is crossed, jobs are in fact canceled.
Thanks for your help. > On Apr 24, 2023, at 1:55 PM, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> > wrote: > > On 24-04-2023 18:33, Hoot Thompson wrote: >> In my reading of the Slurm documentation, it seems that exceeding the limits >> set in GrpTRESMins should result in terminating a running job. However, in >> testing this, The ‘current value’ of the GrpTRESMins only updates upon job >> completion and is not updated as the job progresses. Therefore jobs aren’t >> being stopped. On the positive side, no new jobs are started if the limit is >> exceeded. Here’s the documentation that is confusing me….. > > I think the jobs resource usage will only be added to the Slurm database upon > job completion. I believe that Slurm doesn't update the resource usage > continually as you seem to expect. > >> If any limit is reached, all running jobs with that TRES in this group will >> be killed, and no new jobs will be allowed to run. >> Perhaps there is a setting or misconfiguration on my part. > > The sacctmgr manual page states: > >> GrpTRESMins=TRES=<minutes>[,TRES=<minutes>,...] >> The total number of TRES minutes that can possibly be used by past, present >> and future jobs running from this association and its children. To clear a >> previously set value use the modify command with a new value of -1 for each >> TRES id. >> NOTE: This limit is not enforced if set on the root association of a >> cluster. So even though it may appear in sacctmgr output, it will not be >> enforced. >> ALSO NOTE: This limit only applies when using the Priority Multifactor >> plugin. The time is decayed using the value of PriorityDecayHalfLife or >> PriorityUsageResetPeriod as set in the slurm.conf. When this limit is >> reached all associated jobs running will be killed and all future jobs >> submitted with associations in the group will be delayed until they are able >> to run inside the limit. > > Can you please confirm that you have configured the "Priority Multifactor" > plugin? > > Your jobs should not be able to start if the user's GrpTRESMins has been > exceeded. Hence they won't be killed! > > Can you explain step by step what you observe? It may be that the above > documentation of killing jobs is in error, in which case we should make a bug > report to SchedMD. > > /Ole > > > >