Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

Ole Holm Nielsen Fri, 08 May 2020 11:29:08 -0700

Hi Michael,

Yes, my Slurm tools use and trust the output of Slurm commands such assacct, and any discrepancy would have to come from the Slurm database.Which version of Slurm are you running on the database server and thenode where you run sacct?

Did you add up the GrpTRESRunMins values of all the user's running jobs?They had better add up to current value = 1402415. The "showjob"command prints #CPUs and time limit in minutes, so you need to multiplythese numbers together. Example:

This job requests 160 CPUs and has a time limit of 2-00:00:00(days-hh:mm:ss) = 2880 min.

Did you download the latest versions of my Slurm tools from Github? Imake improvements of them from time to time.


/Ole


On 08-05-2020 16:12, Renfro, Michael wrote:

Thanks, Ole. Your showuserlimits script is actually where I got startedtoday, and where I found the sacct command I sent earlier.

Your script gives the same output for that user: the only line that'snot a "Limit = None" is for the user's GrpTRESRunMins value, which isat "Limit = 1440000, current value = 1402415".

The limit value is correct, but the current value is not (due to theincorrect sacct output).

I've also gone through sacctmgr show runaway to clean up any runawayjobs. I had lots, but they were all from a different user, and had noeffect on this particular user's values.


------------------------------------------------------------------------

*From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf ofOle Holm Nielsen <ole.h.niel...@fysik.dtu.dk>

*Sent:* Friday, May 8, 2020 8:54 AM
*To:* slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>

*Subject:* Re: [slurm-users] scontrol show assoc_mgr showing moreresources in use than squeue


Hi Michael,

Maybe you will find a couple of my Slurm tools useful for displaying
data from the Slurm database in a more user-friendly format:

showjob: Show status of Slurm job(s). Both queue information and
accounting information is printed.

showuserlimits: Print Slurm resource user limits and usage

The user's limits are printed in detail by showuserlimits.

These tools are available from https://github.com/OleHolmNielsen/Slurm_tools

/Ole

On 08-05-2020 15:34, Renfro, Michael wrote:

Hey, folks. I've had a 1000 CPU-day (1440000 CPU-minutes) GrpTRESMins
limit applied to each user for years. It generally works as intended,
but I have one user I've noticed whose usage is highly inflated from
reality, causing the GrpTRESMins limit to be enforced much earlier than
necessary:

squeue output, showing roughly 340 CPU-days in running jobs, and all
other jobs blocked:

# squeue -u USER
JOBID  PARTI       NAME     USER ST         TIME CPUS NODES
NODELIST(REASON) PRIORITY TRES_P START_TIME           TIME_LEFT
747436 batch        job     USER PD         0:00 28   1
(AssocGrpCPURunM 4784     N/A    N/A                  10-00:00:00
747437 batch        job     USER PD         0:00 28   1
(AssocGrpCPURunM 4784     N/A    N/A                  4-04:00:00
747438 batch        job     USER PD         0:00 28   1
(AssocGrpCPURunM 4784     N/A    N/A                  10-00:00:00
747439 batch        job     USER PD         0:00 28   1
(AssocGrpCPURunM 4784     N/A    N/A                  4-04:00:00
747440 batch        job     USER PD         0:00 28   1
(AssocGrpCPURunM 4784     N/A    N/A                  10-00:00:00
747441 batch        job     USER PD         0:00 28   1
(AssocGrpCPURunM 4784     N/A    N/A                  4-14:00:00
747442 batch        job     USER PD         0:00 28   1
(AssocGrpCPURunM 4784     N/A    N/A                  10-00:00:00
747446 batch        job     USER PD         0:00 14   1
(AssocGrpCPURunM 4778     N/A    N/A                  4-00:00:00
747447 batch        job     USER PD         0:00 14   1
(AssocGrpCPURunM 4778     N/A    N/A                  4-00:00:00
747448 batch        job     USER PD         0:00 14   1
(AssocGrpCPURunM 4778     N/A    N/A                  4-00:00:00
747445 batch        job     USER  R      8:39:17 14   1     node002
       4778     N/A    2020-05-07T23:02:19  3-15:20:43
747444 batch        job     USER  R     16:03:13 14   1     node003
       4515     N/A    2020-05-07T15:38:23  3-07:56:47
747435 batch        job     USER  R   1-10:07:42 28   1     node005
       3784     N/A    2020-05-06T21:33:54  8-13:52:18

scontrol output, showing roughly 980 CPU-days in use on the second line,
and thus blocking additional jobs:

# scontrol -o show assoc_mgr users=USER account=ACCOUNT flags=assoc
ClusterName=its Account=ACCOUNT UserName= Partition= Priority=0 ID=21
SharesRaw/Norm/Level/Factor=1/0.03/35/0.00
UsageRaw/Norm/Efctv=2733615872.34/0.39/0.71 ParentAccount=PARENT(9)
Lft=1197 DefAssoc=No GrpJobs=N(4) GrpJobsAccrue=N(10)
GrpSubmitJobs=N(14) GrpWall=N(616142.94)
GrpTRES=cpu=N(84),mem=N(168000),energy=N(0),node=N(40),billing=N(420),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
GrpTRESMins=cpu=N(9239391),mem=N(18478778157),energy=N(0),node=N(616142),billing=N(45546470),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
GrpTRESRunMins=cpu=N(1890060),mem=N(3780121866),energy=N(0),node=N(113778),billing=N(9450304),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
MaxJobs= MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN=
MaxTRESMinsPJ= MinPrioThresh=
ClusterName=its Account=ACCOUNT UserName=USER(UID) Partition= Priority=0
ID=56 SharesRaw/Norm/Level/Factor=1/0.08/13/0.00
UsageRaw/Norm/Efctv=994969457.37/0.14/0.36 ParentAccount= Lft=1218
DefAssoc=Yes GrpJobs=N(3) GrpJobsAccrue=N(10) GrpSubmitJobs=N(13)
GrpWall=N(227625.69)
GrpTRES=cpu=N(56),mem=N(112000),energy=N(0),node=N(35),billing=N(280),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=8(0)
GrpTRESMins=cpu=N(3346095),mem=N(6692190572),energy=N(0),node=N(227625),billing=N(16580497),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
GrpTRESRunMins=cpu=1440000(1407455),mem=N(2814910466),energy=N(0),node=N(88171),billing=N(7037276),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
MaxJobs= MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN=
MaxTRESMinsPJ= MinPrioThresh=

Where can I investigate to find the cause of this difference? Thanks.

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

Reply via email to