Hello,

 

Our slurm cluster managed about 600+ nodes and I tested to set
HealthCheckNodeState=CYCLE in slurm.conf. According to conf manual, setting
this to CYCLE shall cause slurm to "cycle through running on all compute
nodes through the course of the HealthCheckInterval". So I set
"HealthCheckInterval = 600", and expected the health check time point can be
evenly distributed across the 600 seconds period.

But the test result showed that the earliest checked node is at about
14:19:35, while the latest checked node is at about 14:20:39. A round of the
health checks only distributed across 60+ seconds? And the previous checking
round distributed from 14:08:10 to 14:09:26, it seems the
HealthCheckInterval only control the time interval between two rounds, not
the time range distributed by one round checkings.

So did I mistake the description in conf's manual? And is there any method
can control the health check frequency in one round between different nodes?

 

Thanks.

Reply via email to