I have a number of nodes that have, after our transition to Centos 7.3/SLURM 17.02, begun to occasionally display a status of "Not responding". The health check we run on each node every 5 minutes detects nothing, and the nodes are perfectly healthy once I set their state to "idle". The slurmd continues uninterrupted, and the nodes get jobs immediately after going back online.
Has anyone on this list seen similar behavior? I have increased logging to debug/verbose, but have seen no errors worth noting. Cheers, Alden
smime.p7s
Description: S/MIME cryptographic signature