Re: [slurm-users] Time spent in PENDING/Priority

2023-12-07 Thread Chip Seraphine
We use Prometheus as our primary metric tool, and I recently added a metric for jobs in PENDING for the specific reason of “priority”. So we’ll have some nice data for when we are preparing for FY 2025, I suppose, the problem is for this past year we are stuck with what Slurm gathered…. unless

Re: [slurm-users] Time spent in PENDING/Priority

2023-12-07 Thread Ryan Novosielski
I can’t quite answer the question, but I know that Open XDMoD does provide a field that gives this exact information, so they must have a formula they are using. They use exclusively the accounting database, AFAIK. -- #BlackLivesMatter || \\UTGERS, |---*O*---

Re: [slurm-users] Time spent in PENDING/Priority

2023-12-07 Thread Groner, Rob
Ya, I'm kinda looking at exactly this right now as well. For us, I know we're under-utilizing our hardware currently, but I still want to know if the number of pending jobs is growing because that would probably point to something going wrong somewhere. It's a good metric to have. We are goi

[slurm-users] Time spent in PENDING/Priority

2023-12-07 Thread Chip Seraphine
Hi all, I am trying to find some good metrics for our slurm cluster, and want it to reflect a factor that is very important to users—how long did they have to wait because resources were unavailable. This is a very key metric for us because it is a decent approximation of how much life could b

[slurm-users] Reconfigure Gres for Node online?

2023-12-07 Thread Matthias Leopold
Hi, I want to change Gres definition for a Node from NodeName=s0-n10 Gres=gpu:a100:5 to NodeName=s0-n10 Gres=gpu:a100-sxm4-80gb:5 -> HW stays the same, only Gres name changes, a100-sxm4-80gb is already defined in Cluster When I do this online will this affect running jobs on the Node? Slur

Re: [slurm-users] Issues with orphaned jobs after update

2023-12-07 Thread Jeffrey McDonald
Hi, As an update, I able to clear out the orphan/cancelled jobs by rebooting the compute nodes which had cancelled jobs. The error messages have ceased. Regards, Jeff On Wed, Dec 6, 2023 at 8:26 AM Jeffrey McDonald wrote: > Hi, > Yesterday, an upgrade to slurm from 22.05.4 to 23.11.0 went si

Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-07 Thread Stefan Staeglich
Hi Xaver, we also had a similar problem with Slurm 21.08 (see thread "error: power_save module disabled, NULL SuspendProgram"). Fortunately, we have not yet observed this since the upgrade to 23.02. But the time period (about a month) is still too short to know if the problem is really fixed