[slurm-users] Re: controller backup slurmctld error while takeover

Miriam Olmi via slurm-users Mon, 25 Mar 2024 13:40:07 -0700

Hi Brian, 

Thanks for replying.


In my first message I forgot to specify that the primary and the backup 
controller have a shared filesystem mounted. 

The SaveStateLocation points to a directory placed on the shared filesystem so 
both the primary and the backup controller are really reading/writing the very 
same files. 

Any other ideas? 

Thanks again, 
Miriam

Il 25 marzo 2024 19:23:23 CET, Brian Andrus via slurm-users 
<slurm-users@lists.schedmd.com> ha scritto:
>Quick correction, it is SaveStateLocation not SlurmSaveState.
>
>Brian Andrus
>
>On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote:
>> Dear all,
>> 
>> I am having trouble finalizing the configuration of the backup controller 
>> for my slurm cluster.
>> 
>> In principle, if no job is running everything seems fine: both the slurmctld 
>> services on the
>> primary and the backup controller are running and if I stop the service on 
>> the primary controller
>> after 10s more or less (SlurmctldTimeout = 10 sec) the backup controller 
>> takes over.
>> 
>> Also, if I run the sinfo or squeue command during the 10s of inactivity, the 
>> shell stay pending
>> but it recover perfectly after the time needed by the backup controller to 
>> take control and it
>> works the same when the primary controller is back.
>> 
>> 
>> Unfortunately, if I try to do the same test while a job is running there are 
>> two different
>> behaviors depending on the initial scenario.
>> 
>> 1st scenario:
>> Both the primary and backup controller are fine. I launch a batch script and 
>> I verify the script
>> is running with sinfo and squeue. While the script is still running I stop 
>> the service on the
>> primary controller with success but at this point everything gets crazy:
>> 
>> on the backup controller in the slurmctld service log I find the following 
>> errors:
>> 
>> slurmctld: error: Invalid RPC received REQUEST_JOB_INFO while in standby mode
>> slurmctld: error: Invalid RPC received REQUEST_PARTITION_INFO while in 
>> standby mode
>> slurmctld: error: Invalid RPC received REQUEST_JOB_INFO while in standby mode
>> slurmctld: error: Invalid RPC received REQUEST_PARTITION_INFO while in 
>> standby mode
>> slurmctld: error: slurm_accept_msg_conn poll: Bad address
>> slurmctld: error: slurm_accept_msg_conn poll: Bad address
>> 
>> and the commands sinfo and squeue are Unable to contact slurm controller 
>> (connect failure).
>> 
>> 2nd scenario:
>> the primary controller is stopped and I launch a batch job while the backup 
>> controller
>> is the only one working. While the job is running, I restart the slurmctld 
>> service on the primary
>> controller. In this case the primary controller takes over immediately: 
>> everything is smooth
>> and safe and the sinfo and squeue commands continue to work perfectly.
>> 
>> What might be the problem?
>> 
>> Many thanks in advance!
>> 
>> Miriam
>> 
>
>-- 
>slurm-users mailing list -- slurm-users@lists.schedmd.com
>To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: controller backup slurmctld error while takeover

Reply via email to