You can also check out
HealthCheckNodeState=CYCLE
man slurm.conf:
"Rather than running the health check program on all nodes at the same
time, cycle through running on all compute nodes through the course of
the HealthCheckInterval. May be combined with the various node state
options."
--
Chee
when starting node health check in
SLURM
Hi,
We uses HealthCheckProgram = /usr/sbin/nhc in slurm to check node health
every 600 seconds. However, some NHC checks points to a same central resource
thus starting these checks simultaneously may lead to false alarms of service
degrade.
Is
Hi,
We uses HealthCheckProgram = /usr/sbin/nhc in slurm to check node health
every 600 seconds. However, some NHC checks points to a same central resource
thus starting these checks simultaneously may lead to false alarms of service
degrade.
Is it possible to set a random offset to when