Looking at the code, it would seem the DbdBackupHost
(in slurmdbd.conf) is used to determine whether or not to run in
standby mode.
https://github.com/SchedMD/slurm/blob/ea17bbffc381deae54e126b227d5290bf9525326/src/slurmdbd/slurmdbd.c#L296-L314
https://github.com
There are 2 backuphosts configurations.
DbdBackupHost is used if the slurmdbd service
is unavailable (timeout). In that case the slurmctld will try to
connect to the slurmdbd on another node.
StorageBackupHost, on the other hand, is what
you des
Agreed
And slurmdbd also caches if the DB is down, if I remember
correctly.
On 21/02/2025 7:09, Brian Andrus via
slurm-users wrote:
Daniel,
One way to set up a true HA is to configure master-master SQL
ins
It's functionally the same with one difference - the
configuration file is unmodified between nodes, allowing for
simple deployment of nodes, and automation.
Regarding the backuphost - that depends on your setup. If you can
ensure the slurmdbd service wil
I'm not sure it will work, didn't test it, but could you just do
`dbdhost=localhost` to solve this?
On 18/02/2025 11:59, hermes via
slurm-users wrote:
The deployment scenario
is as follows:
There are a couple of options here, not exactly convenient but
will get the job done:
1. Use array, with `-N 1 -w ` defined for each
array task. You can do the same without array, using for loop to
submit different sbatchs.
2. Use `scontrol reboot`. Set the reb
Actually this is not Slurm versioning strictly speaking, this is openapi
versioning - the move from 0.0.38 to 0.0.39 also dropped this particular
endpoint.
You will notice that the same major Slurm version supports different API
versions.
On 28/08/2024 03:02:00, Chris Samuel via slurm-users
https://github.com/SchedMD/slurm/blob/ffae59d9df69aa42a090044b867be660be259620/src/plugins/openapi/v0.0.38/jobs.c#L136
but no longer in
https://github.com/SchedMD/slurm/blob/slurm-23.02/src/plugins/openapi/v0.0.39/jobs.c
Which underwent major revision
In the next openapi version
On 22/0
I think the issue is more severe than you describe.
Slurm juggles the needs of many jobs. Just because there are some
resources available at the exact second a job starts, doesn't mean
those resource are not pre-allocated for some future job waiting
for e
This is a know issue and resolved in 24.05.2 in the patches
labeled "Always allocate pointers despite skipping parsing"
For example:
https://github.com/SchedMD/slurm/commit/5b07b6bda407431215606b93e57d0a9b7f4c9b53
The same patch also applies to 0.0.40 and 0.0
input) to Slurm as a simple string of sbatch flags, and just let Slurm
do it's thing. It sounds simpler than forcing all other users of the
cluster to adhere to your particular needs without introducing
unnecessary complexity to the cluster.
Regards,
Bhaskar.
Regards,
--Dani_L.
O
In the scenario you provide, you don't need anything special.
You just have to configure a partition that is available only to
you, and to no other account on the cluster. This partition will
only include your hosts. All other partition will not include any
I'm not sure I understand why your app must decide the placement, rather
then tell Slurm about the requirements (This sounds suspiciously like
Not Invented Here syndrome), but Slurm does have the '-w' flag to
salloc,sbatch and srun.
I just don't understand if you don't have an entire cluster
Does SACK replace MUNGE? As in - MUNGE is not required when building
Slurm or on compute?
If so, can the Requires and BuildRequires for munge be made optional on
bcond_without_munge in the spec file?
Or is there a reason MUNGE must remain a hard require for Slurm?
Thanks,
--Dani_L.
--
sl
There is a kubeflow offering that might be of interest:
https://www.dkube.io/post/mlops-on-hpc-slurm-with-kubeflow
I have not tried it myself, no idea how well it works.
Regards,
--Dani_L.
On 05/05/2024 0:05, Dan Healy via
slurm-us
15 matches
Mail list logo