Re: [slurm-users] derived counters

Matthew BETTINGER Wed, 14 Apr 2021 07:09:43 -0700

Before you get all excited about it,  we have had a terrible time trying to get 
gppu metrics.  Finally abandoned and switch to  Grafana, Prometheus influx.  
Good luck to you though.


From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of "Heckes, 
Frank" <hec...@mps.mpg.de>
Reply-To: Slurm User Community List <slurm-users@lists.schedmd.com>
Date: Wednesday, April 14, 2021 at 1:56 AM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] derived counters

Hi all, many thanks for all hints. The link in the latest pointing points to an 
impressive switch-board.
Cheers,
-Frank

From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Renfro, 
Michael
Sent: Tuesday, 13 April 2021 19:25
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] derived counters

I'll never miss an opportunity to plug XDMoD for anyone who doesn't want to 
write custom analytics for every metric. I've managed to get a little bit into 
its API to extract current values for number of jobs completed and the number 
of CPU-hours provided, and insert those into a single slide presentation for 
introductory meetings.

You can see a working version of it for the NSF XSEDE facilities at 
https://xdmod.ccr.buffalo.edu

From: slurm-users 
<slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Hadrian Djohari <hx...@case.edu<mailto:hx...@case.edu>>
Date: Tuesday, April 13, 2021 at 8:11 AM
To: Slurm User Community List 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] derived counters

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

________________________________
Hi Frank,

A way to get "how long jobs wait in the queue" is to import the data to XDMOD 
(https://open.xdmod.org/9.0/index.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fopen.xdmod.org%2F9.0%2Findex.html&data=04%7C01%7Crenfro%40tntech.edu%7C38d51462bef94bee8a9708d8fe7db3d9%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637539163146606550%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=5d82B%2BR1JhiuuUn0is%2FWojmMlt87YpzLnBI%2FOtpokTY%3D&reserved=0>).
The nifty reporting tool has many features to make it easier for us to report 
out the cluster usage.

Hadrian

On Tue, Apr 13, 2021 at 8:08 AM Heckes, Frank 
<hec...@mps.mpg.de<mailto:hec...@mps.mpg.de>> wrote:
Hello Ole,

> >> -----Original Message-----
> >>>    * (average) queue length for a certain partition
>
> I wonder what exactly does your question mean?  Maybe the number of jobs or
> CPUs in the Pending state?  Maybe relative to the number of CPUs in the
> partition?
>
This result from a mgmt. - question. How long jobs have to wait (in s, min, h, 
day) before they getting executed and
how many jobs are waiting (are queued) for each partition in a certain time 
interval.
The first one is easy to find with sacct and submit, start counts + difference 
+ averaging.
The second is a bit cumbersome, so I wonder whether a 'solution' is already 
around. The easiest way is to monitor from the beginning and store the squeue 
ouput for later evaluation. Unfortunately I didn’t do that.

Cheers,
-Frank

> The "slurmacct" command prints (possibly for a specified partition) the
> average job waiting time while Pending in the queue, but not the queue length
> information.
>
> It may be difficult to answer your question from the Slurm database.  The 
> sacct
> command displays accounting data for all jobs and job steps, but not directly
> for partitions.
>
> There are other Slurm monitoring tools which perhaps can supply the data you
> are looking for.  You could ask this list again.
>
> /Ole


--
Hadrian Djohari
Manager of Research Computing Services, [U]Tech
Case Western Reserve University
(W): 216-368-0395
(M): 216-798-7490

Re: [slurm-users] derived counters

Reply via email to