Hello,
Our slurm cluster managed about 600+ nodes and I tested to set HealthCheckNodeState=CYCLE in slurm.conf. According to conf manual, setting this to CYCLE shall cause slurm to "cycle through running on all compute nodes through the course of the HealthCheckInterval". So I set "HealthCheckInterval = 600", and expected the health check time point can be evenly distributed across the 600 seconds period. But the test result showed that the earliest checked node is at about 14:19:35, while the latest checked node is at about 14:20:39. A round of the health checks only distributed across 60+ seconds? And the previous checking round distributed from 14:08:10 to 14:09:26, it seems the HealthCheckInterval only control the time interval between two rounds, not the time range distributed by one round checkings. So did I mistake the description in conf's manual? And is there any method can control the health check frequency in one round between different nodes? Thanks.