Hi all,I am looking for a clean way to set up Slurms native high availability feature. I am managing a Slurm cluster with one control node (hosting both slurmctld and slurmdbd), one login node and a few dozen compute nodes. I have a virtual machine that I want to set up as a backup control node.
The Slurm documentation says the following about the StateSaveLocation directory:
The directory used should be on a low-latency local disk to prevent file system delays from affecting Slurm performance. If using a backup host, the StateSaveLocation should reside on a file system shared by the two hosts. We do not recommend using NFS to make the directory accessible to both hosts, but do recommend a shared mount that is accessible to the two controllers and allows low-latency reads and writes to the disk. If a controller comes up without access to the state information, queued and running jobs will be cancelled. [1]
My question: How do I implement the shared file system for the StateSaveLocation?
I do not want to introduce a single point of failure by having a single node that hosts the StateSaveLocation, neither do I want to put that directory on the clusters NFS storage since outages/downtime of the storage system will happen at some point and I do not want that to cause an outage of the Slurm controller.
Any help or ideas would be appreciated. Best, Pierre [1] https://slurm.schedmd.com/quickstart_admin.html#Config -- Pierre Abele, M.Sc. HPC Administrator Max-Planck-Institute for Evolutionary Anthropology Department of Primate Behavior and Evolution Deutscher Platz 6 04103 Leipzig Room: U2.80 E-Mail: pierre_ab...@eva.mpg.de Phone: +49 (0) 341 3550 245
smime.p7s
Description: S/MIME Cryptographic Signature
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com