Thanks, Ole. I forgot I had that tool already. Not seeing where the limits are getting enforced. But now I’ve narrowed it down to some of my partitions or my job routing Lua plugin:
===== [renfro@login ~]$ hpcshell --reservation=slurm-upgrade --partition=interactive srun: job 232423 queued and waiting for resources ^Csrun: Job allocation 232423 has been revoked srun: Force Terminated job 232423 [renfro@login ~]$ hpcshell --reservation=slurm-upgrade --partition=batch [renfro@node001(job 232424) ~]$ exit [renfro@login ~]$ ===== ===== JobId=232423 UserId=renfro(177805483) GroupId=domain users(177800513) Name=bash JobState=CANCELLED Partition=any-interactive TimeLimit=120 StartTime=2019-12-16T13:58:59 EndTime=2019-12-16T13:58:59 NodeList=(null) NodeCnt=0 ProcCnt=1 WorkDir=/home/tntech.edu/renfro ReservationName=slurm-upgrade Gres= Account=hpcadmins QOS=normal WcKey= Cluster=its SubmitTime=2019-12-16T13:58:56 EligibleTime=2019-12-16T13:58:56 DerivedExitCode=0:0 ExitCode=0:0 JobId=232424 UserId=renfro(177805483) GroupId=domain users(177800513) Name=bash JobState=COMPLETED Partition=batch TimeLimit=1440 StartTime=2019-12-16T13:59:02 EndTime=2019-12-16T13:59:20 NodeList=node001 NodeCnt=1 ProcCnt=1 WorkDir=/home/tntech.edu/renfro ReservationName=slurm-upgrade Gres= Account=hpcadmins QOS=normal WcKey= Cluster=its SubmitTime=2019-12-16T13:59:02 EligibleTime=2019-12-16T13:59:02 DerivedExitCode=0:0 ExitCode=0:0 ===== > On Dec 16, 2019, at 1:03 PM, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> > wrote: > > External Email Warning > > This email originated from outside the university. Please use caution when > opening attachments, clicking links, or responding to requests. > > ________________________________ > > Hi Mike, > > My showuserlimits tool prints nicely user limits from the Slurm database: > https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserlimits > > Maybe this can give you further insights into the source of problems. > > /Ole > > On 16-12-2019 17:27, Renfro, Michael wrote: >> Hey, folks. I’ve just upgraded from Slurm 17.02 (way behind schedule, I >> know) to 19.05. The only thing I’ve noticed going wrong is that my user >> resource limits aren’t being applied correctly. >> >> My typical user has a GrpTRESRunMin limit of cpu=1440000 (1000 CPU-days), >> and after the upgrade, it appears that limit is blocking jobs even when I’m >> only requesting a very small amount of resources (2 CPU-hours). >> >> With no limits, job runs fine: >> >> ===== >> >> [root@login ~]# squeue -u renfro >> JOBID PARTITION NAME USER ST TIME NODES >> NODELIST(REASON) >> [root@login ~]# sacctmgr modify user renfro set grptresrunmin=cpu=-1 >> >> [renfro@login ~]$ hpcshell --reservation=slurm-upgrade >> [renfro@gpunode001(job 232393) ~]$ exit >> >> ===== >> >> With the 1000 CPU-days limit, a 2 CPU-hour jobs is permanently pending: >> >> ===== >> >> [root@login ~]# sacctmgr modify user renfro set grptresrunmin=cpu=1440000 >> >> [renfro@login ~]$ hpcshell --reservation=slurm-upgrade >> srun: job 232394 queued and waiting for resources >> >> [root@login ~]# scontrol show job 232394 >> JobId=232394 JobName=bash >> UserId=renfro(177805483) GroupId=domain users(177800513) MCS_label=N/A >> Priority=99249 Nice=0 Account=hpcadmins QOS=normal >> JobState=PENDING Reason=AssocGrpCPURunMinutesLimit Dependency=(null) >> Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 >> RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A >> SubmitTime=2019-12-16T10:22:38 EligibleTime=2019-12-16T10:22:38 >> AccrueTime=2019-12-16T10:22:38 >> StartTime=Unknown EndTime=Unknown Deadline=N/A >> SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-12-16T10:22:43 >> Partition=any-interactive AllocNode:Sid=login.hpc.tntech.edu:74850 >> ReqNodeList=(null) ExcNodeList=(null) >> NodeList=(null) >> NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* >> TRES=cpu=1,mem=2000M,node=1,billing=1 >> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* >> MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0 >> Features=(null) DelayBoot=00:00:00 >> Reservation=slurm-upgrade >> OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) >> Command=bash >> WorkDir=/home/tntech.edu/renfro >> Power= >> >> ===== >> >> No other jobs under the hpcadmins account are running or queued. Any ideas >> on what might be going on? Thanks for any help provided. >