On 08/25/2017 06:19 PM, Nicholas McCollum wrote:
I like your documentation but I would add a few things:
I highly recommend not having the slurmctld start automatically upon
reboot. If for some reason the slurm spool directory isn't available
(on a shared folder) it will cause all the jobs to die across the
cluster. I always like to triple check to make sure that the directory
is available before starting the slurmctld.
I also find it helpful, especially in instances like this, to run the
daemon in foreground mode.
# slurmctld -Dvvvv
# slurmd -Dvvvv
This will print out any errors directly on the terminal and you can see
right away while the daemon has crashed or failed to start.
Thanks for your nice comments. I added a section about manual daemon
startup to cover the scenario you describe:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#manual-startup-of-services
It's difficult to foresee every kind of problem which may occur, but
it's good to have common scenarios in the documentation.
Our Slurm master server only has local storage, but I suppose that you
need shared remote storage for Slurm HA controllers?
/Ole