I use telegraf (which supports "exporter" output format as well) to
capture cgroupsv2 job data:
https://github.com/jose-d/telegraf-configs/tree/master/slurm-cgroupsv2
I had to rework it when changing from cgroupsv1 to cgroupsv2, as the
format/structure of textfiles changed a bit.
cheers
jos
Dear all,
I am having trouble finalizing the configuration of the backup
controller for my slurm cluster.
In principle, if no job is running everything seems fine: both the
slurmctld services on the
primary and the backup controller are running and if I stop the service
on the primary contro
Miriam,
You need to ensure the SlurmSaveState directory is the same for both.
And by 'the same', I mean all contents are exactly the same.
This is usually achieved by using a shared drive or replication.
Brian Andrus
On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote:
Dear all,
I am hav
Quick correction, it is SaveStateLocation not SlurmSaveState.
Brian Andrus
On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote:
Dear all,
I am having trouble finalizing the configuration of the backup
controller for my slurm cluster.
In principle, if no job is running everything seems f
Hi Brian,
Thanks for replying.
In my first message I forgot to specify that the primary and the backup
controller have a shared filesystem mounted.
The SaveStateLocation points to a directory placed on the shared filesystem so
both the primary and the backup controller are really reading/wr
I would hazard to guess that the DNS is not working fully from or for
the nodes themselves.
Validate that you can ping the nodes by name from the backup controller.
Also verify they are named what the dns says they are. And validate you
can ping the backup controller from the nodes by the nam