[slurm-users] Re: controller backup slurmctld error while takeover

Miriam Olmi via slurm-users Tue, 26 Mar 2024 07:20:47 -0700

I checked what you were suggesting: both the controllers can communicatewithout any problem to all the nodes.

Today I tried multiple times the dynamics of takeover between theprimary and the backup controller and I noticed thatthe first scontrol takeover works perfectly: the backup controllerbecome the controller in charge and all the sinfo and

squeue requests are satisfied.

Everything works fine also when the primary controller is back online:again the jobs continue to run fine and the

output of the commands sinfo and squeue are fine.

The very strange behavior arrives when I test the takeover mechanism forthe second consecutive time with thecommand scontrol takeover always from the backup controller: in thiscase everything seems fine at the beginning fromthe log file: all the information about the resources and the jobs arerecovered and the backup controller declare tobe the current primary controller but then it gets completely crazy.Here are the few lines in the log file before the

madness starts:

[2024-03-26T15:12:17.664] Running as primary controller

[2024-03-26T15:12:17.664] debug: Heartbeat thread started, beatingevery 5 seconds.

[2024-03-26T15:12:17.712] debug:  No feds to retrieve from state

[2024-03-26T15:12:17.796] Resuming operation with already establishedlistening sockets

[2024-03-26T15:12:17.797] error: slurm_accept_msg_conn poll: Bad address
[2024-03-26T15:12:17.797] error: slurm_accept_msg_conn poll: Bad address
[2024-03-26T15:12:17.797] error: slurm_accept_msg_conn poll: Bad address

From this moment on, the error message is the only thing that iscontinuously written on the log file and the controlleris unreachable by all the nodes: the only way to recover the backupcontroller is to restart the service.


What might be the cause?


On 25/03/24 22:59, Brian Andrus via slurm-users wrote:

I would hazard to guess that the DNS is not working fully from or forthe nodes themselves.

Validate that you can ping the nodes by name from the backupcontroller. Also verify they are named what the dns says they are. And validate you can ping the backup controller from the nodes by thename it has in the slurm.conf file.

Also, a quick way to do the failover check is to run (from the backupcontroller): scontrol takeover


Brian Andrus

On 3/25/2024 1:39 PM, Miriam Olmi wrote:

Hi Brian,

Thanks for replying.

In my first message I forgot to specify that the primary and thebackup controller have a shared filesystem mounted.

The SaveStateLocation points to a directory placed on the sharedfilesystem so both the primary and the backup controller are reallyreading/writing the very same files.


Any other ideas?

Thanks again,
Miriam

Il 25 marzo 2024 19:23:23 CET, Brian Andrus via slurm-users<slurm-users@lists.schedmd.com> ha scritto:


    Quick correction, it is SaveStateLocation not SlurmSaveState.
    Brian Andrus On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users
    wrote:

        Dear all, I am having trouble finalizing the configuration of
        the backup controller for my slurm cluster. In principle, if
        no job is running everything seems fine: both the slurmctld
        services on the primary and the backup controller are running
        and if I stop the service on the primary controller after 10s
        more or less (SlurmctldTimeout = 10 sec) the backup
        controller takes over. Also, if I run the sinfo or squeue
        command during the 10s of inactivity, the shell stay pending
        but it recover perfectly after the time needed by the backup
        controller to take control and it works the same when the
        primary controller is back. Unfortunately, if I try to do the
        same test while a job is running there are two different
        behaviors depending on the initial scenario. 1st scenario:
        Both the primary and backup controller are fine. I launch a
        batch script and I verify the script is running with sinfo
        and squeue. While the script is still running I stop the
        service on the primary controller with success but at this
        point everything gets crazy: on the backup controller in the
        slurmctld service log I find the following errors: slurmctld:
        error: Invalid RPC received REQUEST_JOB_INFO while in standby
        mode slurmctld: error: Invalid RPC received
        REQUEST_PARTITION_INFO while in standby mode slurmctld:
        error: Invalid RPC received REQUEST_JOB_INFO while in standby
        mode slurmctld: error: Invalid RPC received
        REQUEST_PARTITION_INFO while in standby mode slurmctld:
        error: slurm_accept_msg_conn poll: Bad address slurmctld:
        error: slurm_accept_msg_conn poll: Bad address and the
        commands sinfo and squeue are Unable to contact slurm
        controller (connect failure). 2nd scenario: the primary
        controller is stopped and I launch a batch job while the
        backup controller is the only one working. While the job is
        running, I restart the slurmctld service on the primary
        controller. In this case the primary controller takes over
        immediately: everything is smooth and safe and the sinfo and
        squeue commands continue to work perfectly. What might be the

problem? Many thanks in advance! Miriam

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: controller backup slurmctld error while takeover

Reply via email to