Resolved now. On older versions of Slurm, I could have queues without default times specified (just an upper limit, in my case). As of Slurm 18 or 19, I had to add a default time to all my queues to avoid the AssocGrpCPURunMinutesLimit flag.
> On Dec 16, 2019, at 2:00 PM, Renfro, Michael <ren...@tntech.edu> wrote: > > Thanks, Ole. I forgot I had that tool already. Not seeing where the limits > are getting enforced. But now I’ve narrowed it down to some of my partitions > or my job routing Lua plugin: > > ===== > > [renfro@login ~]$ hpcshell --reservation=slurm-upgrade --partition=interactive > srun: job 232423 queued and waiting for resources > ^Csrun: Job allocation 232423 has been revoked > srun: Force Terminated job 232423 > [renfro@login ~]$ hpcshell --reservation=slurm-upgrade --partition=batch > [renfro@node001(job 232424) ~]$ exit > [renfro@login ~]$ > > ===== > > ===== > > JobId=232423 UserId=renfro(177805483) GroupId=domain users(177800513) > Name=bash JobState=CANCELLED Partition=any-interactive TimeLimit=120 > StartTime=2019-12-16T13:58:59 EndTime=2019-12-16T13:58:59 NodeList=(null) > NodeCnt=0 ProcCnt=1 WorkDir=/home/tntech.edu/renfro > ReservationName=slurm-upgrade Gres= Account=hpcadmins QOS=normal WcKey= > Cluster=its SubmitTime=2019-12-16T13:58:56 EligibleTime=2019-12-16T13:58:56 > DerivedExitCode=0:0 ExitCode=0:0 > JobId=232424 UserId=renfro(177805483) GroupId=domain users(177800513) > Name=bash JobState=COMPLETED Partition=batch TimeLimit=1440 > StartTime=2019-12-16T13:59:02 EndTime=2019-12-16T13:59:20 NodeList=node001 > NodeCnt=1 ProcCnt=1 WorkDir=/home/tntech.edu/renfro > ReservationName=slurm-upgrade Gres= Account=hpcadmins QOS=normal WcKey= > Cluster=its SubmitTime=2019-12-16T13:59:02 EligibleTime=2019-12-16T13:59:02 > DerivedExitCode=0:0 ExitCode=0:0 > > ===== > >> On Dec 16, 2019, at 1:03 PM, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> >> wrote: >> >> External Email Warning >> >> This email originated from outside the university. Please use caution when >> opening attachments, clicking links, or responding to requests. >> >> ________________________________ >> >> Hi Mike, >> >> My showuserlimits tool prints nicely user limits from the Slurm database: >> https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserlimits >> >> Maybe this can give you further insights into the source of problems. >> >> /Ole >> >> On 16-12-2019 17:27, Renfro, Michael wrote: >>> Hey, folks. I’ve just upgraded from Slurm 17.02 (way behind schedule, I >>> know) to 19.05. The only thing I’ve noticed going wrong is that my user >>> resource limits aren’t being applied correctly. >>> >>> My typical user has a GrpTRESRunMin limit of cpu=1440000 (1000 CPU-days), >>> and after the upgrade, it appears that limit is blocking jobs even when I’m >>> only requesting a very small amount of resources (2 CPU-hours). >>> >>> With no limits, job runs fine: >>> >>> ===== >>> >>> [root@login ~]# squeue -u renfro >>> JOBID PARTITION NAME USER ST TIME NODES >>> NODELIST(REASON) >>> [root@login ~]# sacctmgr modify user renfro set grptresrunmin=cpu=-1 >>> >>> [renfro@login ~]$ hpcshell --reservation=slurm-upgrade >>> [renfro@gpunode001(job 232393) ~]$ exit >>> >>> ===== >>> >>> With the 1000 CPU-days limit, a 2 CPU-hour jobs is permanently pending: >>> >>> ===== >>> >>> [root@login ~]# sacctmgr modify user renfro set grptresrunmin=cpu=1440000 >>> >>> [renfro@login ~]$ hpcshell --reservation=slurm-upgrade >>> srun: job 232394 queued and waiting for resources >>> >>> [root@login ~]# scontrol show job 232394 >>> JobId=232394 JobName=bash >>> UserId=renfro(177805483) GroupId=domain users(177800513) MCS_label=N/A >>> Priority=99249 Nice=0 Account=hpcadmins QOS=normal >>> JobState=PENDING Reason=AssocGrpCPURunMinutesLimit Dependency=(null) >>> Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 >>> RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A >>> SubmitTime=2019-12-16T10:22:38 EligibleTime=2019-12-16T10:22:38 >>> AccrueTime=2019-12-16T10:22:38 >>> StartTime=Unknown EndTime=Unknown Deadline=N/A >>> SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-12-16T10:22:43 >>> Partition=any-interactive AllocNode:Sid=login.hpc.tntech.edu:74850 >>> ReqNodeList=(null) ExcNodeList=(null) >>> NodeList=(null) >>> NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* >>> TRES=cpu=1,mem=2000M,node=1,billing=1 >>> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* >>> MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0 >>> Features=(null) DelayBoot=00:00:00 >>> Reservation=slurm-upgrade >>> OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) >>> Command=bash >>> WorkDir=/home/tntech.edu/renfro >>> Power= >>> >>> ===== >>> >>> No other jobs under the hpcadmins account are running or queued. Any ideas >>> on what might be going on? Thanks for any help provided. >> >