wupeng created HADOOP-16724: ------------------------------- Summary: namenode state frequent switches from active to standby, the ERROR log is IPC's epoch 187 is less than the last promised epoch 188 Key: HADOOP-16724 URL: https://issues.apache.org/jira/browse/HADOOP-16724 Project: Hadoop Common Issue Type: Bug Components: common Affects Versions: 2.7.3 Reporter: wupeng
Recently, namenodes in our cluster often have their states changed from active to standby. Below is the log: active namenode : Serving checkpoints at http://zk14:50070 2019-11-24 08:43:09,765 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(406)) - Took 17727ms to send a batch of 1 edits (208 bytes) to remote journal 52.82.32.47.2:8485 2019-11-24 08:43:10,006 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal 52.82.32.47.2:8485 failed to write txns 1942183681-1942183681. Will try to write to this JN again after the next log roll. org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 187 is less than the last promised epoch 188 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:428) at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:456) at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:351) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:152) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2345) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554) at org.apache.hadoop.ipc.Client.call(Client.java:1498) at org.apache.hadoop.ipc.Client.call(Client.java:1398) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at com.sun.proxy.$Proxy11.journal(Unknown Source) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167) at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385) at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2019-11-24 08:43:19,695 INFO namenode.TransferFsImage (TransferFsImage.java:receiveFile(575)) - Combined time for fsimage download and fsync to all disks took 9.35s. The fsimage download took 9.35s at 156316.01 KB/s. Synchronous (fsync) write to disk of /hadoop/hdfs/namenode/current/fsimage.ckpt_0000000001942183423 took 0.00s. Synchronous (fsync) write to disk of /data10_4T/hadoop/hdfs/namenode/current/fsimage.ckpt_0000000001942183423 took 0.00s. 2019-11-24 08:43:19,695 INFO namenode.TransferFsImage (TransferFsImage.java:handleUploadImageRequest(135)) - Downloaded file fsimage.ckpt_0000000001942183423 size 1496792686 bytes. 2019-11-24 08:43:19,807 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(203)) - Going to retain 2 images with txid >= 1941733640 master zkfc: 2019-11-24 08:43:04,352 INFO zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1279)) - Session establishment complete on server zk14/52.82.32.47.14:2181, sessionid = 0x269c073377c56e4, negotiated timeout = 20000 2019-11-24 08:43:06,557 WARN ha.ActiveStandbyElector (ActiveStandbyElector.java:becomeActive(868)) - Exception handling the winning of election org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hadoop-ha/ns/ActiveBreadCrumb at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270) at org.apache.hadoop.ha.ActiveStandbyElector$5.run(ActiveStandbyElector.java:1025) at org.apache.hadoop.ha.ActiveStandbyElector$5.run(ActiveStandbyElector.java:1022) at org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1067) at org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1059) at org.apache.hadoop.ha.ActiveStandbyElector.setDataWithRetries(ActiveStandbyElector.java:1022) at org.apache.hadoop.ha.ActiveStandbyElector.writeBreadCrumbNode(ActiveStandbyElector.java:891) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:861) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:534) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) 2019-11-24 08:43:06,557 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:reJoinElection(723)) - Trying to re-establish ZK session 2019-11-24 08:43:06,557 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:terminateConnection(835)) - Terminating ZK connection for elector id=324457684 appData=0a026e7312036e6e321a047a6b313420d43e28d33e cb=Elector callbacks for NameNode at zk14/52.82.32.47.14:8020 2019-11-24 08:43:06,605 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x269c073377c56e4 closed 2019-11-24 08:43:07,605 INFO zookeeper.ZooKeeper (ZooKeeper.java:<init>(438)) - Initiating client connection, connectString=zk14:2181,zk2:2181,zk1:2181 sessionTimeout=20000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@538e9048 2019-11-24 08:43:07,606 INFO zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server zk14/52.82.32.47.14:2181. Will not attempt to authenticate using SASL (unknown error) 2019-11-24 08:43:07,607 INFO zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(864)) - Socket connection established to zk14/52.82.32.47.14:2181, initiating session 2019-11-24 08:43:07,631 INFO zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1279)) - Session establishment complete on server zk14/52.82.32.47.14:2181, sessionid = 0x269c073377c57c6, negotiated timeout = 20000 2019-11-24 08:43:07,631 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:processWatchEvent(600)) - Session connected. 2019-11-24 08:43:07,631 WARN ha.ActiveStandbyElector (ActiveStandbyElector.java:isStaleClient(1093)) - Ignoring stale result from old client with sessionId 0x269c073377c56e4 2019-11-24 08:43:07,632 WARN ha.ActiveStandbyElector (ActiveStandbyElector.java:isStaleClient(1093)) - Ignoring stale result from old client with sessionId 0x269c073377c56e4 2019-11-24 08:43:07,632 WARN ha.ActiveStandbyElector (ActiveStandbyElector.java:isStaleClient(1093)) - Ignoring stale result from old client with sessionId 0x269c073377c56e4 follower zkfc: 2019-11-24 08:43:06,577 INFO ha.ZKFailoverController (ZKFailoverController.java:becomeStandby(484)) - ZK Election indicated that NameNode at zk2/52.82.32.47.2:8020 should become standby 2019-11-24 08:43:06,579 INFO ha.ZKFailoverController (ZKFailoverController.java:becomeStandby(489)) - Successfully transitioned NameNode at zk2/52.82.32.47.2:8020 to standby state 2019-11-24 08:43:06,613 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:fenceOldActive(936)) - Checking for any old active which needs to be fenced... 2019-11-24 08:43:06,613 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:fenceOldActive(957)) - Old node exists: 0a026e7312036e6e321a047a6b313420d43e28d33e 2019-11-24 08:43:06,613 INFO ha.ZKFailoverController (ZKFailoverController.java:doFence(513)) - Should fence: NameNode at zk14/52.82.32.47.14:8020 2019-11-24 08:43:07,159 INFO ha.ZKFailoverController (ZKFailoverController.java:doFence(519)) - Successfully transitioned NameNode at zk14/52.82.32.47.14:8020 to standby state without fencing 2019-11-24 08:43:07,159 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:writeBreadCrumbNode(883)) - Writing znode /hadoop-ha/ns/ActiveBreadCrumb to indicate that the local node is the most recent active... 2019-11-24 08:43:07,212 INFO ha.ZKFailoverController (ZKFailoverController.java:becomeActive(380)) - Trying to make NameNode at zk2/52.82.32.47.2:8020 active... 2019-11-24 08:43:11,918 INFO ha.ZKFailoverController (ZKFailoverController.java:becomeActive(387)) - Successfully transitioned NameNode at zk2/52.82.32.47.2:8020 to active state 2019-11-24 09:00:32,482 WARN ha.HealthMonitor (HealthMonitor.java:doHealthChecks(211)) - Transport-level exception trying to monitor health of NameNode at zk2/52.82.32.47.2:8020: java.io.IOException: Connection reset by peer Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: "zk2/52.82.32.47.2"; destination host is: "zk2":8020; 2019-11-24 09:00:32,482 INFO ha.HealthMonitor (HealthMonitor.java:enterState(249)) - Entering state SERVICE_NOT_RESPONDING 2019-11-24 09:00:32,482 INFO ha.ZKFailoverController (ZKFailoverController.java:setLastHealthState(851)) - Local service NameNode at zk2/52.82.32.47.2:8020 entered state: SERVICE_NOT_RESPONDING 2019-11-24 09:00:32,483 WARN tools.DFSZKFailoverController (DFSZKFailoverController.java:getLocalNNThreadDump(244)) - Can't get local NN thread dump due to Connection refused (Connection refused) 2019-11-24 09:00:32,483 INFO ha.ZKFailoverController (ZKFailoverController.java:recheckElectability(768)) - Quitting master election for NameNode at zk2/52.82.32.47.2:8020 and marking that fencing is necessary 2019-11-24 09:00:32,483 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:quitElection(406)) - Yielding from election 2019-11-24 09:00:32,483 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:terminateConnection(835)) - Terminating ZK connection for elector id=324457684 appData=0a026e7312036e6e311a037a6b3220d43e28d33e cb=Elector callbacks for NameNode at zk2/52.82.32.47.2:8020 2019-11-24 09:00:32,514 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x169c07336613aab closed 2019-11-24 09:00:32,581 WARN ha.ActiveStandbyElector (ActiveStandbyElector.java:isStaleClient(1093)) - Ignoring stale result from old client with sessionId 0x169c07336613aab -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org