I've seen this happen when there are internode communications issues which disrupt the tree that slurm uses to talk to the nodes and do heartbeat.  We have this happen occassionally in our environment as we have nodes that are two geographically seperate facilities and the latency is substantial, thus the lag crossing back and for can add up. I would check to see if all your nodes can talk to each other and the master and if your Timeouts are set high enough.

-Paul Edmon-


On 12/04/2017 01:57 PM, Stradling, Alden Reid (ars9ac) wrote:
I have a number of nodes that have, after our transition to Centos 7.3/SLURM 17.02, begun to 
occasionally display a status of "Not responding". The health check we run on each node 
every 5 minutes detects nothing, and the nodes are perfectly healthy once I set their state to 
"idle". The slurmd continues uninterrupted, and the nodes get jobs immediately after 
going back online.

Has anyone on this list seen similar behavior? I have increased logging to 
debug/verbose, but have seen no errors worth noting.

Cheers,

Alden



Reply via email to