Re: [slurm-users] Intermittent "Not responding" status

Paul Edmon Mon, 04 Dec 2017 11:05:21 -0800

I've seen this happen when there are internode communications issueswhich disrupt the tree that slurm uses to talk to the nodes and doheartbeat. We have this happen occassionally in our environment as wehave nodes that are two geographically seperate facilities and thelatency is substantial, thus the lag crossing back and for can add up. Iwould check to see if all your nodes can talk to each other and themaster and if your Timeouts are set high enough.


-Paul Edmon-



On 12/04/2017 01:57 PM, Stradling, Alden Reid (ars9ac) wrote:

I have a number of nodes that have, after our transition to Centos 7.3/SLURM 17.02, begun to 
occasionally display a status of "Not responding". The health check we run on each node 
every 5 minutes detects nothing, and the nodes are perfectly healthy once I set their state to 
"idle". The slurmd continues uninterrupted, and the nodes get jobs immediately after 
going back online.

Has anyone on this list seen similar behavior? I have increased logging to 
debug/verbose, but have seen no errors worth noting.

Cheers,

Alden

Re: [slurm-users] Intermittent "Not responding" status

Reply via email to