Daniel,
One way to set up a true HA is to configure master-master SQL instances
on both head nodes. Then have each slurmdbd point to the other SQL
instance as the backup host.
This is likely not necessary as all data going to slurmdbd is cached if
slurmdbd is unavailable. In the real world, this generally gives ample
time to recover without issue.
Brian Andrus
On 2/20/2025 6:45 PM, hermes via slurm-users wrote:
Thank you for your insightful suggestions. Placing both slurmdbd and
slurmctld on the same node is indeed a new structure that we hadn’t
considered before, and it seems to provide a much clearer logic for
deployment.
Regarding the usage of DbdBackupHost, I would like to confirm my
understanding of how it works. Is it mean that the DbdBackupHost
option will only be referenced when slurmdbd service detects its local
database (specified by StorageHost) is unavailable? And I guess in
that case, the first slurmdbd service would act as a proxy who
forwards requests to the DbdBackupHost and returns the data from there
to slurmctld?
*发件人:*Daniel Letai <d...@letai.org.il>
*发送时间:*2025年2月20日21:56
*收件人:*taleinterve...@sjtu.edu.cn
*抄送:*slurm-users@lists.schedmd.com
*主题:*Re: [slurm-users] Re: how to set slurmdbd.conf if using two
slurmdb node with HA database?
It's functionally the same with one difference - the configuration
file is unmodified between nodes, allowing for simple deployment of
nodes, and automation.
Regarding the backuphost - that depends on your setup. If you can
ensure the slurmdbd service will stop if the local db replica is not
healthy, you shouldn't need backuphost. Conversely, if there is no
health check to ensure replica readiness, configure the backuphost.
This will require using a different conf file for each node, unless
setting up a more robust HA clustering scheme.
The other option is to separate the dbd from the db. Put the dbd on
the ctld nodes (A,B) and let nodes C,D only be DB master replica (not
dbd).
In slurm.conf on nodes A,B You will then have:
AccountingStorageHost = localhost
(without AccountingStorageBackupHost)
And in slurmdbd.conf you will have:
DbdHost = localhost
(without DbdBackupHost)
StorageHost = nodeC
StorageBackupHost = nodeD
This would mean identical slurm.conf and slurmdbd.conf on both nodes
A,B, and no slurm conf files or processes on nodes C,D.
This setup assumes that the entire stack (ctld+dbd) is either working
or not, which is usually true, as either the node is functioning or
not. If the ctld is working but dbd is not, you will loose connection
to the DB. If the ctld is not working, the other ctld will take charge
and use its local dbd, so that scenario is covered.
Adding AccountingStorageBackupHost pointing to the other node is of
course possible, but will mean different slurm.conf files which slurm
will complain about.
It will mean that most of the time you will not load balance on the
multi-master DB replicas. Whether that is a consideration or not is
for you to decide.
On 20/02/2025 3:57, taleinterve...@sjtu.edu.cn wrote:
Do you mean the second configuration scheme?
I think configuring `dbdhost=localhost` is the same as configuring
` DbdAddr =nodeC` and ` DbdAddr =nodeD` on the two nodes respectively.
The key point is whether we should set the DbdBackupHost option
and how it work?
*发件人:*Daniel Letai <d...@letai.org.il> <mailto:d...@letai.org.il>
*发送时间:*2025年2月19日18:21
*收件人:*slurm-users@lists.schedmd.com
*主题:*[slurm-users] Re: how to set slurmdbd.conf if using two
slurmdb node with HA database?
I'm not sure it will work, didn't test it, but could you just do
`dbdhost=localhost` to solve this?
On 18/02/2025 11:59, hermes via slurm-users wrote:
The deployment scenario is as follows:
*nodeA**nodeB*
(slurmctld) (backup slurmctld)
| \-------------------------------/ |
| / \ |
*nodeC**nodeD*
(slurmdbd) (backup slurmdbd)
(mysql) <--multi master replica--> (mysql)
Since the database is multi-master replicated, the slurmdbd
should only talk to the mysql on its own node.
In such case, how should we set the slurmdbd.conf? The conf
file contains options “DbdAddr”, “DbdHost”and “DbdBackupHost”.
Should they be consistent between nodeA-2 and nodeB-2? Such as:
DbdAddr = nodeC | DbdAddr = nodeC
DbdHost = nodeC | DbdHost = nodeC
DbdBackupHost = nodeD | DbdBackupHost = nodeD
StorageHost = nodeC | StorageHost = nodeD
Or maybe just set different conf and don’t use the
“DbdBackupHost”like:
DbdAddr = nodeC | DbdAddr = nodeD
DbdHost = nodeC | DbdHost = nodeD
StorageHost = nodeC | StorageHost = nodeD
I’m quite confused about the usage of DbdAddr and DbdHost.
What is the difference between them and why only DbdHost has
the backup one?
Another confusing point is how DbdBackupHost work. I guess It
is slurmctld that is responsible for selecting the available
slurmdbd. Since the slurm.conf already contains
“AccountingStorageHost”and “AccountingStorageBackupHost”, why
we need set backupdbd again on slurmdbd side?
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com