Before you get all excited about it, we have had a terrible time trying to get gppu metrics. Finally abandoned and switch to Grafana, Prometheus influx. Good luck to you though.
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of "Heckes, Frank" <hec...@mps.mpg.de> Reply-To: Slurm User Community List <slurm-users@lists.schedmd.com> Date: Wednesday, April 14, 2021 at 1:56 AM To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] derived counters Hi all, many thanks for all hints. The link in the latest pointing points to an impressive switch-board. Cheers, -Frank From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Renfro, Michael Sent: Tuesday, 13 April 2021 19:25 To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] derived counters I'll never miss an opportunity to plug XDMoD for anyone who doesn't want to write custom analytics for every metric. I've managed to get a little bit into its API to extract current values for number of jobs completed and the number of CPU-hours provided, and insert those into a single slide presentation for introductory meetings. You can see a working version of it for the NSF XSEDE facilities at https://xdmod.ccr.buffalo.edu From: slurm-users <slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of Hadrian Djohari <hx...@case.edu<mailto:hx...@case.edu>> Date: Tuesday, April 13, 2021 at 8:11 AM To: Slurm User Community List <slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>> Subject: Re: [slurm-users] derived counters External Email Warning This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests. ________________________________ Hi Frank, A way to get "how long jobs wait in the queue" is to import the data to XDMOD (https://open.xdmod.org/9.0/index.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fopen.xdmod.org%2F9.0%2Findex.html&data=04%7C01%7Crenfro%40tntech.edu%7C38d51462bef94bee8a9708d8fe7db3d9%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637539163146606550%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=5d82B%2BR1JhiuuUn0is%2FWojmMlt87YpzLnBI%2FOtpokTY%3D&reserved=0>). The nifty reporting tool has many features to make it easier for us to report out the cluster usage. Hadrian On Tue, Apr 13, 2021 at 8:08 AM Heckes, Frank <hec...@mps.mpg.de<mailto:hec...@mps.mpg.de>> wrote: Hello Ole, > >> -----Original Message----- > >>> * (average) queue length for a certain partition > > I wonder what exactly does your question mean? Maybe the number of jobs or > CPUs in the Pending state? Maybe relative to the number of CPUs in the > partition? > This result from a mgmt. - question. How long jobs have to wait (in s, min, h, day) before they getting executed and how many jobs are waiting (are queued) for each partition in a certain time interval. The first one is easy to find with sacct and submit, start counts + difference + averaging. The second is a bit cumbersome, so I wonder whether a 'solution' is already around. The easiest way is to monitor from the beginning and store the squeue ouput for later evaluation. Unfortunately I didn’t do that. Cheers, -Frank > The "slurmacct" command prints (possibly for a specified partition) the > average job waiting time while Pending in the queue, but not the queue length > information. > > It may be difficult to answer your question from the Slurm database. The > sacct > command displays accounting data for all jobs and job steps, but not directly > for partitions. > > There are other Slurm monitoring tools which perhaps can supply the data you > are looking for. You could ask this list again. > > /Ole -- Hadrian Djohari Manager of Research Computing Services, [U]Tech Case Western Reserve University (W): 216-368-0395 (M): 216-798-7490