[slurm-users] Re: controller backup slurmctld error while takeover

2024-03-26 Thread Miriam Olmi via slurm-users
I checked what you were suggesting: both the controllers can communicate without any problem to all the nodes. Today I tried multiple times the dynamics of takeover between the primary and the backup controller and I noticed that the first scontrol takeover works perfectly: the backup controlle

[slurm-users] Re: controller backup slurmctld error while takeover

2024-03-25 Thread Brian Andrus via slurm-users
I would hazard to guess that the DNS is not working fully from or for the nodes themselves. Validate that you can ping the nodes by name from the backup controller. Also verify they are named what the dns says they are.  And validate you can ping the backup controller from the nodes by the nam

[slurm-users] Re: controller backup slurmctld error while takeover

2024-03-25 Thread Miriam Olmi via slurm-users
Hi Brian, Thanks for replying. In my first message I forgot to specify that the primary and the backup controller have a shared filesystem mounted. The SaveStateLocation points to a directory placed on the shared filesystem so both the primary and the backup controller are really reading/wr

[slurm-users] Re: controller backup slurmctld error while takeover

2024-03-25 Thread Brian Andrus via slurm-users
Quick correction, it is SaveStateLocation not SlurmSaveState. Brian Andrus On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote: Dear all, I am having trouble finalizing the configuration of the backup controller for my slurm cluster. In principle, if no job is running everything seems f

[slurm-users] Re: controller backup slurmctld error while takeover

2024-03-25 Thread Brian Andrus via slurm-users
Miriam, You need to ensure the SlurmSaveState directory is the same for both. And by 'the same', I mean all contents are exactly the same. This is usually achieved by using a shared drive or replication. Brian Andrus On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote: Dear all, I am hav