Hi Brian, Thanks for replying.
In my first message I forgot to specify that the primary and the backup controller have a shared filesystem mounted. The SaveStateLocation points to a directory placed on the shared filesystem so both the primary and the backup controller are really reading/writing the very same files. Any other ideas? Thanks again, Miriam Il 25 marzo 2024 19:23:23 CET, Brian Andrus via slurm-users <slurm-users@lists.schedmd.com> ha scritto: >Quick correction, it is SaveStateLocation not SlurmSaveState. > >Brian Andrus > >On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote: >> Dear all, >> >> I am having trouble finalizing the configuration of the backup controller >> for my slurm cluster. >> >> In principle, if no job is running everything seems fine: both the slurmctld >> services on the >> primary and the backup controller are running and if I stop the service on >> the primary controller >> after 10s more or less (SlurmctldTimeout = 10 sec) the backup controller >> takes over. >> >> Also, if I run the sinfo or squeue command during the 10s of inactivity, the >> shell stay pending >> but it recover perfectly after the time needed by the backup controller to >> take control and it >> works the same when the primary controller is back. >> >> >> Unfortunately, if I try to do the same test while a job is running there are >> two different >> behaviors depending on the initial scenario. >> >> 1st scenario: >> Both the primary and backup controller are fine. I launch a batch script and >> I verify the script >> is running with sinfo and squeue. While the script is still running I stop >> the service on the >> primary controller with success but at this point everything gets crazy: >> >> on the backup controller in the slurmctld service log I find the following >> errors: >> >> slurmctld: error: Invalid RPC received REQUEST_JOB_INFO while in standby mode >> slurmctld: error: Invalid RPC received REQUEST_PARTITION_INFO while in >> standby mode >> slurmctld: error: Invalid RPC received REQUEST_JOB_INFO while in standby mode >> slurmctld: error: Invalid RPC received REQUEST_PARTITION_INFO while in >> standby mode >> slurmctld: error: slurm_accept_msg_conn poll: Bad address >> slurmctld: error: slurm_accept_msg_conn poll: Bad address >> >> and the commands sinfo and squeue are Unable to contact slurm controller >> (connect failure). >> >> 2nd scenario: >> the primary controller is stopped and I launch a batch job while the backup >> controller >> is the only one working. While the job is running, I restart the slurmctld >> service on the primary >> controller. In this case the primary controller takes over immediately: >> everything is smooth >> and safe and the sinfo and squeue commands continue to work perfectly. >> >> What might be the problem? >> >> Many thanks in advance! >> >> Miriam >> > >-- >slurm-users mailing list -- slurm-users@lists.schedmd.com >To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com