[ https://issues.apache.org/jira/browse/HDFS-8221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brahma Reddy Battula resolved HDFS-8221. ---------------------------------------- Resolution: Duplicate > HDFS have two Standby NNs because ActiveStandbyElectorLock ephemeralOwner in > ZK is different with the sessionId stored in ZKFC > ------------------------------------------------------------------------------------------------------------------------------ > > Key: HDFS-8221 > URL: https://issues.apache.org/jira/browse/HDFS-8221 > Project: Hadoop HDFS > Issue Type: Bug > Components: auto-failover > Affects Versions: 2.4.1 > Reporter: shenxingfeng > > Firstly, NN1 is active, NN2 is standby. When NN1 become standby due to some > reasons, NN2 then take over the active state imediately. But after NN2 > becoming active, It changed to standby again. And, HDFS got two standby NN > forever. > After check the log, I found that NN2 become standby beacuse It have wrong > sessionID with ActiveStandbyElectorLock ephemeralOwner stored in Znode. > And the rootcause is when NN1 go to standby, NN2 create one session A with > zk, and become active. Ideally, NN2 should have the same sessionID with > ActiveStandbyElectorLock ephemeralOwner stored in Znode, but some network > reason can result in NN2'ZKFC sessionID changed. > So, I think when NN2 become standby due to different sessionid, NN2 should > unlock the state in Znode in order to failover again. > ActiveStandyElector.processResult > ================== > Code code = Code.get(rc); > if (isSuccess(code)) { > // the following owner check completes verification in case the lock > znode > // creation was retried > if (stat.getEphemeralOwner() == zkClient.getSessionId()) { > // we own the lock znode. so we are the leader > if (!becomeActive()) { > reJoinElectionAfterFailureToBecomeActive(); > } > } else { > // we dont own the lock znode. so we are a standby. > becomeStandby(); > } > // the watch set by us will notify about changes > return; > } > ActiveStandbyElectorLock content > ================== > [zk: 160.149.0.114:24002(CONNECTED) 1] get > /hadoop-ha/hacluster/ActiveStandbyElectorLock > 160-149-0-117 锟斤拷(锟斤拷 > cZxid = 0x2000a38d9 > ctime = Thu Apr 16 11:32:54 CST 2015 > mZxid = 0x2000a38d9 > mtime = Thu Apr 16 11:32:54 CST 2015 > pZxid = 0x2000a38d9 > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0x164cb2b3e4b36ae4 > dataLength = 38 > numChildren = 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)