On Tuesday, 5 December 2017 5:57:59 AM AEDT Stradling, Alden Reid (ars9ac)
wrote:
> I have a number of nodes that have, after our transition to Centos 7.3/SLURM
> 17.02, begun to occasionally display a status of "Not responding".
I'd suggest checking in your slurmd and slurmctld logs to see if a
I've seen this happen when there are internode communications issues
which disrupt the tree that slurm uses to talk to the nodes and do
heartbeat. We have this happen occassionally in our environment as we
have nodes that are two geographically seperate facilities and the
latency is substantia
I have a number of nodes that have, after our transition to Centos 7.3/SLURM
17.02, begun to occasionally display a status of "Not responding". The health
check we run on each node every 5 minutes detects nothing, and the nodes are
perfectly healthy once I set their state to "idle". The slurmd c