Ya, I'm kinda looking at exactly this right now as well.  For us, I know we're 
under-utilizing our hardware currently, but I still want to know if the number 
of pending jobs is growing because that would probably point to something going 
wrong somewhere.   It's a good metric to have.

We are going the route of using pyslurm/graphite/grafana to get our answers.  I 
know there is also a prometheus slurm data tool/grafana dashboards that might 
work just as well.

With pyslurm, I end up with an array of all current jobs and can then grab my 
metrics as needed.  We currently measure the "queue" time by comparing when the 
job was submitted vs. current time, as long as the job is Pending.  Once it's 
running, then the time spent in the queue is start time minus submit time.

You could view the job Reason to determine if it is for Resources, or for QOS 
limits, etc.  I kinda only care about Resource-related pending, but we could 
also use the QOS/group CPU limit-related pending as a way to show users if they 
purchased more CPU time then they'd wait much less.

Some of what I'm saying is hypothetical, we aren't actually graphing queue time 
yet, or at least, not like I want to. But that is how I plan to go about it.

Rob

________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Chip 
Seraphine <cseraph...@drwholdings.com>
Sent: Thursday, December 7, 2023 3:09 PM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: [slurm-users] Time spent in PENDING/Priority

[You don't often get email from cseraph...@drwholdings.com. Learn why this is 
important at https://aka.ms/LearnAboutSenderIdentification ]

Hi all,

I am trying to find some good metrics for our slurm cluster, and want it to 
reflect a factor that is very important to users—how long did they have to wait 
because resources were unavailable.  This is a very key metric for us because 
it is a decent approximation of how much life could be improved if we had more 
capacity, so it’d be an important consideration when doing growth planning, 
setting user expectations, etc.  So we are specifically interested in how long 
jobs were in the PENDING state for reason Priority.

Unfortunately, I’m finding that this is difficult to pull out of squeue or the 
accounting data.    My first thought was that I could simply subtract 
SubmitTime from EligibleTime (or StartTime), but that includes time spent in 
expected ways, e.g. waiting while an array chugs along.   The delta between 
StartTime and EligibleTime does not reflect the time spent PENDING at all, so 
it’s not useful either.

I can grab some of my own metrics by polling squeue or the REST interface, I 
suppose, but those will be less accurate, more work, and will not allow me to 
see my past history.  I was wondering if there was something I was missing that 
someone on the list has figured out?   Perhaps some existing bit of accounting 
data that can tell me how long a job was stuck behind other jobs?

--

Chip Seraphine
Grid Operations
For support please use help-grid in email or slack.
This e-mail and any attachments may contain information that is confidential 
and proprietary and otherwise protected from disclosure. If you are not the 
intended recipient of this e-mail, do not read, duplicate or redistribute it by 
any means. Please immediately delete it and any attachments and notify the 
sender that you have received it by mistake. Unintended recipients are 
prohibited from taking action on the basis of information in this e-mail or any 
attachments. The DRW Companies make no representations that this e-mail or any 
attachments are free of computer viruses or other defects.

Reply via email to