I checked what you were suggesting: both the controllers can communicate
without any problem to all the nodes.
Today I tried multiple times the dynamics of takeover between the
primary and the backup controller and I noticed that
the first scontrol takeover works perfectly: the backup controlle
I would hazard to guess that the DNS is not working fully from or for
the nodes themselves.
Validate that you can ping the nodes by name from the backup controller.
Also verify they are named what the dns says they are. And validate you
can ping the backup controller from the nodes by the nam
Hi Brian,
Thanks for replying.
In my first message I forgot to specify that the primary and the backup
controller have a shared filesystem mounted.
The SaveStateLocation points to a directory placed on the shared filesystem so
both the primary and the backup controller are really reading/wr
Quick correction, it is SaveStateLocation not SlurmSaveState.
Brian Andrus
On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote:
Dear all,
I am having trouble finalizing the configuration of the backup
controller for my slurm cluster.
In principle, if no job is running everything seems f
Miriam,
You need to ensure the SlurmSaveState directory is the same for both.
And by 'the same', I mean all contents are exactly the same.
This is usually achieved by using a shared drive or replication.
Brian Andrus
On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote:
Dear all,
I am hav