We use Prometheus as our primary metric tool, and I recently added a metric for
jobs in PENDING for the specific reason of “priority”. So we’ll have some nice
data for when we are preparing for FY 2025, I suppose, the problem is for this
past year we are stuck with what Slurm gathered…. unless
I can’t quite answer the question, but I know that Open XDMoD does provide a
field that gives this exact information, so they must have a formula they are
using. They use exclusively the accounting database, AFAIK.
--
#BlackLivesMatter
|| \\UTGERS, |---*O*---
Ya, I'm kinda looking at exactly this right now as well. For us, I know we're
under-utilizing our hardware currently, but I still want to know if the number
of pending jobs is growing because that would probably point to something going
wrong somewhere. It's a good metric to have.
We are goi
Hi all,
I am trying to find some good metrics for our slurm cluster, and want it to
reflect a factor that is very important to users—how long did they have to wait
because resources were unavailable. This is a very key metric for us because
it is a decent approximation of how much life could b
Hi,
I want to change Gres definition for a Node
from
NodeName=s0-n10 Gres=gpu:a100:5
to
NodeName=s0-n10 Gres=gpu:a100-sxm4-80gb:5
-> HW stays the same, only Gres name changes, a100-sxm4-80gb is already
defined in Cluster
When I do this online will this affect running jobs on the Node?
Slur
Hi,
As an update, I able to clear out the orphan/cancelled jobs by rebooting
the compute nodes which had cancelled jobs. The error messages have
ceased.
Regards,
Jeff
On Wed, Dec 6, 2023 at 8:26 AM Jeffrey McDonald wrote:
> Hi,
> Yesterday, an upgrade to slurm from 22.05.4 to 23.11.0 went si
Hi Xaver,
we also had a similar problem with Slurm 21.08 (see thread "error: power_save
module disabled, NULL SuspendProgram").
Fortunately, we have not yet observed this since the upgrade to 23.02. But the
time period (about a month) is still too short to know if the problem is
really fixed