Daniel,

One way to set up a true HA is to configure master-master SQL instances on both head nodes. Then have each slurmdbd point to the other SQL instance as the backup host.

This is likely not necessary as all data going to slurmdbd is cached if slurmdbd is unavailable. In the real world, this generally gives ample time to recover without issue.

Brian Andrus

On 2/20/2025 6:45 PM, hermes via slurm-users wrote:

Thank you for your insightful suggestions. Placing both slurmdbd and slurmctld on the same node is indeed a new structure  that we hadn’t considered before, and it seems to provide a much clearer logic for deployment.

Regarding the usage of DbdBackupHost, I would like to confirm my understanding of how it works. Is it mean that the DbdBackupHost option will only be referenced when slurmdbd service detects its local database (specified by StorageHost) is unavailable? And I guess in that case, the first slurmdbd service would act as a proxy who forwards requests to the DbdBackupHost and returns the data from there to slurmctld?

*发件人:*Daniel Letai <d...@letai.org.il>
*发送时间:*2025年2月20日21:56
*收件人:*taleinterve...@sjtu.edu.cn
*抄送:*slurm-users@lists.schedmd.com
*主题:*Re: [slurm-users] Re: how to set slurmdbd.conf if using two slurmdb node with HA database?

It's functionally the same with one difference - the configuration file is unmodified between nodes, allowing for simple deployment of nodes, and automation.

Regarding the backuphost - that depends on your setup. If you can ensure the slurmdbd service will stop if the local db replica is not healthy, you shouldn't need backuphost. Conversely, if there is no health check to ensure replica readiness, configure the backuphost. This will require using a different conf file for each node, unless setting up a more robust HA clustering scheme.

The other option is to separate the dbd from the db. Put the dbd on the ctld nodes (A,B) and let nodes C,D only be DB master replica (not dbd).

In slurm.conf on nodes A,B You will then have:

AccountingStorageHost = localhost

(without AccountingStorageBackupHost)

And in slurmdbd.conf you will have:

DbdHost = localhost

(without DbdBackupHost)

StorageHost = nodeC

StorageBackupHost = nodeD

This would mean identical slurm.conf and slurmdbd.conf on both nodes A,B, and no slurm conf files or processes on nodes C,D.

This setup assumes that the entire stack (ctld+dbd) is either working or not, which is usually true, as either the node is functioning or not. If the ctld is working but dbd is not, you will loose connection to the DB. If the ctld is not working, the other ctld will take charge and use its local dbd, so that scenario is covered.

Adding AccountingStorageBackupHost pointing to the other node is of course possible, but will mean different slurm.conf files which slurm will complain about.

It will mean that most of the time you will not load balance on the multi-master DB replicas. Whether that is a consideration or not is for you to decide.

On 20/02/2025 3:57, taleinterve...@sjtu.edu.cn wrote:

    Do you mean the second configuration scheme?

    I think configuring `dbdhost=localhost` is the same as configuring
    ` DbdAddr =nodeC` and ` DbdAddr =nodeD` on the two nodes respectively.

    The key point is whether we should set the DbdBackupHost option
    and how it work?

    *发件人:*Daniel Letai <d...@letai.org.il> <mailto:d...@letai.org.il>
    *发送时间:*2025年2月19日18:21
    *收件人:*slurm-users@lists.schedmd.com
    *主题:*[slurm-users] Re: how to set slurmdbd.conf if using two
    slurmdb node with HA database?

    I'm not sure it will work, didn't test it, but could you just do
    `dbdhost=localhost` to solve this?

    On 18/02/2025 11:59, hermes via slurm-users wrote:

        The deployment scenario is as follows:

        *nodeA**nodeB*

        (slurmctld)               (backup slurmctld)

            | \-------------------------------/ |

            | /                               \ |

        *nodeC**nodeD*

        (slurmdbd)              (backup slurmdbd)

        (mysql)   <--multi master replica-->  (mysql)

        Since the database is multi-master replicated, the slurmdbd
        should only talk to the mysql on its own node.

        In such case, how should we set the slurmdbd.conf? The conf
        file contains options “DbdAddr”, “DbdHost”and “DbdBackupHost”.

        Should they be consistent between nodeA-2 and nodeB-2? Such as:

        DbdAddr = nodeC              | DbdAddr = nodeC

        DbdHost = nodeC              | DbdHost = nodeC

        DbdBackupHost = nodeD        | DbdBackupHost = nodeD

        StorageHost = nodeC           | StorageHost = nodeD

        Or maybe just set different conf and don’t use the
        “DbdBackupHost”like:

        DbdAddr = nodeC             | DbdAddr = nodeD

        DbdHost = nodeC             | DbdHost = nodeD

        StorageHost = nodeC          | StorageHost = nodeD

        I’m quite confused about the usage of DbdAddr and DbdHost.
        What is the difference between them and why only DbdHost has
        the backup one?

        Another confusing point is how DbdBackupHost work. I guess It
        is slurmctld that is responsible for selecting the available
        slurmdbd. Since the slurm.conf already contains
        “AccountingStorageHost”and “AccountingStorageBackupHost”, why
        we need set backupdbd again on slurmdbd side?




-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to