Hi, Xand: How does adding "ReqMem" to the sacct change the output?
E.g. on my cluster running Slurm 20.02.7 (on RHEL8), our GPU nodes have TRESBillingWeights=CPU=0,Mem=0,GRES/gpu=43: $ sacct --format=JobID%25,State,AllocTRES%50,ReqTRES,ReqMem,ReqCPUS|grep RUNNING JobID State AllocTRES ReqTRES ReqMem ReqCPUS ------------------------- ---------- -------------------------------------------------- ---------- ---------- -------- 2512977.batch RUNNING cpu=48,mem=0,node=1 0n 48 2512977.extern RUNNING billing=516,cpu=144,gres/gpu=12,node=3 0n 144 2512977.0 RUNNING cpu=24,gres/gpu:v100=8,gres/gpu=8,mem=0,node=2 0n 24 2513020 RUNNING billing=516,cpu=144,gres/gpu=12,node=3 billing=5+ 0n 144 I.e. note the "mem=0", and absence of the mem field on some of those lines. In squeue: JOBID PART NAME USER STATE TIME TIME_LIMIT NODES MIN_MEMO NODELIST(REASON) 2512977 gpu 1AB_96DMPCLoose_ ba553 RUNNING 22:29:20 1-00:00:00 3 0 gpu[001,003-004] In comparison, a job on our def partition which requests a specific amount of mem: (sacct) JobID State AllocTRES ReqTRES ReqMem ReqCPUS ------------------------- ---------- -------------------------------------------------- ---------- ---------- -------- 2514854 RUNNING billing=1,cpu=1,mem=36G,node=1 billing=1+ 36Gn 1 2514854.batch RUNNING cpu=1,mem=36G,node=1 36Gn 1 2514854.extern RUNNING billing=1,cpu=1,mem=36G,node=1 36Gn 1 and the squeue line: JOBID PART NAME USER STATE TIME TIME_LIMIT NODES MIN_MEMO NODELIST(REASON) 2514854 def ClusterJobStart_ sbradley RUNNING 5:05:27 8:00:00 1 36G node003 -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode ________________________________ From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Xand Meaden <xand.mea...@kcl.ac.uk> Sent: Wednesday, January 12, 2022 12:23 To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com> Subject: [slurm-users] Memory usage not tracked External. Hi, We wish to record memory usage of HPC jobs, but with Slurm 20.11 cannot get this to work - the information is simply missing. Our two older clusters with Slurm 19.05 will record memory usage as a TRES, e.g. as shown below: # sacct --format=JobID,State,AllocTRES%32|grep RUNNING|head -4 14029267 RUNNING billing=32,cpu=32,mem=185600M,n+ 14037739 RUNNING billing=64,cpu=64,mem=250G,node+ 14037739.ba+ RUNNING cpu=32,mem=125G,node=1 14037739.0 RUNNING cpu=1,mem=4000M,node=1 However with 20.11 we see no memory usage: # sacct --format=JobID,State,AllocTRES%32|grep RUNNING|head -4 771 RUNNING billing=36,cpu=36,node=1 771.batch RUNNING cpu=36,mem=0,node=1 816 RUNNING billing=128,cpu=128,node=1 823 RUNNING billing=36,cpu=36,node=1 I've also checked within the slurm database's cluster_job_table, and tres_alloc has no "2=" (memory) value for any job. >From my reading of >https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Ftres.html&data=04%7C01%7Cdwc62%40drexel.edu%7C98efffa860f64c58bfa408d9d5f03fe4%7C3664e6fa47bd45a696708c4f080f8ca6%7C0%7C1%7C637776050108044394%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=rlTTua04KSGUUrK7X8%2FJ7ce1tLv5%2BrdfIkvpSc%2BxsRw%3D&reserved=0 > it's not possible to disable memory as a TRES, so I can't figure out what I'm missing here. The 20.11 cluster is running on Ubuntu 20.04 (vs CentOS 7 for the others), in case that makes any difference! Thanks in advance, Xand Drexel Internal Data