On 10/24/22 09:57, Diego Zuccato wrote:
Il 24/10/2022 09:32, Ole Holm Nielsen ha scritto:
> It is definitely a BAD idea to store Slurm StateSaveLocation on a slow
> NFS directory! SchedMD recommends to use local NVME or SSD disks
> because there will be many IOPS to this file system!
IIUC it does have to be shared between controllers, right?
Possibly use NVME-backed (or even better NVDIMM-backed) NFS share. Or
replica-3 Gluster volume with NVDIMMs for the bricks, for the paranoid :)
IOPS is the key parameter! Local NVME or SSD should beat any networked
storage. The original question refers to having StateSaveLocation on a
standard (slow) NFS drive, AFAICT.
I don't know how many people prefer using 2 slurmctld hosts (primary and
backup)? I certainly don't do that. Slurm does have a configurable
SlurmctldTimeout parameter so that you can reboot the server quickly when
needed.
It would be nice if people with experience in HA storage for slurmctld
could comment.
/Ole