Lin Yiqun created HADOOP-12680: ---------------------------------- Summary: Loss of zookeeper quorum lead all the namenode to be standby state Key: HADOOP-12680 URL: https://issues.apache.org/jira/browse/HADOOP-12680 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.7.1 Reporter: Lin Yiqun
When I am upgrading my zookeeper cluster, and will change the ip address of zk nodes. And I found two namenodes of my hadoop cluster got loss of connection with zk. And when I revocer the zk cluster, the two namenodes are both transitioned to standby state and this makes cluster can't provide service. I found the reason may be is following: {code} /** * If the elector gets disconnected from Zookeeper and does not know about * the lock state, then it will notify the service via the enterNeutralMode * interface. The service may choose to ignore this or stop doing state * changing operations. Upon reconnection, the elector verifies the leader * status and calls back on the becomeActive and becomeStandby app * interfaces. <br/> * Zookeeper disconnects can happen due to network issues or loss of * Zookeeper quorum. Thus enterNeutralMode can be used to guard against * split-brain issues. In such situations it might be prudent to call * becomeStandby too. However, such state change operations might be * expensive and enterNeutralMode can help guard against doing that for * transient issues. */ void enterNeutralMode(); {code} May be we should create a thread to monitor the stat of namenodes and don't let them all to be standby state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)