wupeng created HADOOP-16724:
-------------------------------

             Summary: namenode state frequent switches from active to standby, 
the ERROR log is IPC's epoch 187 is less than the last promised epoch 188
                 Key: HADOOP-16724
                 URL: https://issues.apache.org/jira/browse/HADOOP-16724
             Project: Hadoop Common
          Issue Type: Bug
          Components: common
    Affects Versions: 2.7.3
            Reporter: wupeng


Recently, namenodes in our cluster often have their states changed from active 
to standby.
Below is the log:
active namenode :
Serving checkpoints at http://zk14:50070
2019-11-24 08:43:09,765 WARN client.QuorumJournalManager 
(IPCLoggerChannel.java:call(406)) - Took 17727ms to send a batch of 1 edits 
(208 bytes) to remote journal 52.82.32.47.2:8485
2019-11-24 08:43:10,006 WARN client.QuorumJournalManager 
(IPCLoggerChannel.java:call(388)) - Remote journal 52.82.32.47.2:8485 failed to 
write txns 1942183681-1942183681. Will try to write to this JN again after the 
next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 187 is 
less than the last promised epoch 188
 at 
org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:428)
 at 
org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:456)
 at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:351)
 at 
org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:152)
 at 
org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
 at 
org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2345)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
 at org.apache.hadoop.ipc.Client.call(Client.java:1498)
 at org.apache.hadoop.ipc.Client.call(Client.java:1398)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
 at com.sun.proxy.$Proxy11.journal(Unknown Source)
 at 
org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
 at 
org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
 at 
org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
2019-11-24 08:43:19,695 INFO namenode.TransferFsImage 
(TransferFsImage.java:receiveFile(575)) - Combined time for fsimage download 
and fsync to all disks took 9.35s. The fsimage download took 9.35s at 156316.01 
KB/s. Synchronous (fsync) write to disk of 
/hadoop/hdfs/namenode/current/fsimage.ckpt_0000000001942183423 took 0.00s. 
Synchronous (fsync) write to disk of 
/data10_4T/hadoop/hdfs/namenode/current/fsimage.ckpt_0000000001942183423 took 
0.00s.
2019-11-24 08:43:19,695 INFO namenode.TransferFsImage 
(TransferFsImage.java:handleUploadImageRequest(135)) - Downloaded file 
fsimage.ckpt_0000000001942183423 size 1496792686 bytes.
2019-11-24 08:43:19,807 INFO namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:getImageTxIdToRetain(203)) - Going to retain 2 
images with txid >= 1941733640

master zkfc:
2019-11-24 08:43:04,352 INFO zookeeper.ClientCnxn 
(ClientCnxn.java:onConnected(1279)) - Session establishment complete on server 
zk14/52.82.32.47.14:2181, sessionid = 0x269c073377c56e4, negotiated timeout = 
20000
2019-11-24 08:43:06,557 WARN ha.ActiveStandbyElector 
(ActiveStandbyElector.java:becomeActive(868)) - Exception handling the winning 
of election
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
BadVersion for /hadoop-ha/ns/ActiveBreadCrumb
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
 at 
org.apache.hadoop.ha.ActiveStandbyElector$5.run(ActiveStandbyElector.java:1025)
 at 
org.apache.hadoop.ha.ActiveStandbyElector$5.run(ActiveStandbyElector.java:1022)
 at 
org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1067)
 at 
org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1059)
 at 
org.apache.hadoop.ha.ActiveStandbyElector.setDataWithRetries(ActiveStandbyElector.java:1022)
 at 
org.apache.hadoop.ha.ActiveStandbyElector.writeBreadCrumbNode(ActiveStandbyElector.java:891)
 at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:861)
 at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:534)
 at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
2019-11-24 08:43:06,557 INFO ha.ActiveStandbyElector 
(ActiveStandbyElector.java:reJoinElection(723)) - Trying to re-establish ZK 
session
2019-11-24 08:43:06,557 INFO ha.ActiveStandbyElector 
(ActiveStandbyElector.java:terminateConnection(835)) - Terminating ZK 
connection for elector id=324457684 
appData=0a026e7312036e6e321a047a6b313420d43e28d33e cb=Elector callbacks for 
NameNode at zk14/52.82.32.47.14:8020
2019-11-24 08:43:06,605 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - 
Session: 0x269c073377c56e4 closed
2019-11-24 08:43:07,605 INFO zookeeper.ZooKeeper (ZooKeeper.java:<init>(438)) - 
Initiating client connection, connectString=zk14:2181,zk2:2181,zk1:2181 
sessionTimeout=20000 
watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@538e9048
2019-11-24 08:43:07,606 INFO zookeeper.ClientCnxn 
(ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server 
zk14/52.82.32.47.14:2181. Will not attempt to authenticate using SASL (unknown 
error)
2019-11-24 08:43:07,607 INFO zookeeper.ClientCnxn 
(ClientCnxn.java:primeConnection(864)) - Socket connection established to 
zk14/52.82.32.47.14:2181, initiating session
2019-11-24 08:43:07,631 INFO zookeeper.ClientCnxn 
(ClientCnxn.java:onConnected(1279)) - Session establishment complete on server 
zk14/52.82.32.47.14:2181, sessionid = 0x269c073377c57c6, negotiated timeout = 
20000
2019-11-24 08:43:07,631 INFO ha.ActiveStandbyElector 
(ActiveStandbyElector.java:processWatchEvent(600)) - Session connected.
2019-11-24 08:43:07,631 WARN ha.ActiveStandbyElector 
(ActiveStandbyElector.java:isStaleClient(1093)) - Ignoring stale result from 
old client with sessionId 0x269c073377c56e4
2019-11-24 08:43:07,632 WARN ha.ActiveStandbyElector 
(ActiveStandbyElector.java:isStaleClient(1093)) - Ignoring stale result from 
old client with sessionId 0x269c073377c56e4
2019-11-24 08:43:07,632 WARN ha.ActiveStandbyElector 
(ActiveStandbyElector.java:isStaleClient(1093)) - Ignoring stale result from 
old client with sessionId 0x269c073377c56e4

follower zkfc:
2019-11-24 08:43:06,577 INFO ha.ZKFailoverController 
(ZKFailoverController.java:becomeStandby(484)) - ZK Election indicated that 
NameNode at zk2/52.82.32.47.2:8020 should become standby
2019-11-24 08:43:06,579 INFO ha.ZKFailoverController 
(ZKFailoverController.java:becomeStandby(489)) - Successfully transitioned 
NameNode at zk2/52.82.32.47.2:8020 to standby state
2019-11-24 08:43:06,613 INFO ha.ActiveStandbyElector 
(ActiveStandbyElector.java:fenceOldActive(936)) - Checking for any old active 
which needs to be fenced...
2019-11-24 08:43:06,613 INFO ha.ActiveStandbyElector 
(ActiveStandbyElector.java:fenceOldActive(957)) - Old node exists: 
0a026e7312036e6e321a047a6b313420d43e28d33e
2019-11-24 08:43:06,613 INFO ha.ZKFailoverController 
(ZKFailoverController.java:doFence(513)) - Should fence: NameNode at 
zk14/52.82.32.47.14:8020
2019-11-24 08:43:07,159 INFO ha.ZKFailoverController 
(ZKFailoverController.java:doFence(519)) - Successfully transitioned NameNode 
at zk14/52.82.32.47.14:8020 to standby state without fencing
2019-11-24 08:43:07,159 INFO ha.ActiveStandbyElector 
(ActiveStandbyElector.java:writeBreadCrumbNode(883)) - Writing znode 
/hadoop-ha/ns/ActiveBreadCrumb to indicate that the local node is the most 
recent active...
2019-11-24 08:43:07,212 INFO ha.ZKFailoverController 
(ZKFailoverController.java:becomeActive(380)) - Trying to make NameNode at 
zk2/52.82.32.47.2:8020 active...
2019-11-24 08:43:11,918 INFO ha.ZKFailoverController 
(ZKFailoverController.java:becomeActive(387)) - Successfully transitioned 
NameNode at zk2/52.82.32.47.2:8020 to active state
2019-11-24 09:00:32,482 WARN ha.HealthMonitor 
(HealthMonitor.java:doHealthChecks(211)) - Transport-level exception trying to 
monitor health of NameNode at zk2/52.82.32.47.2:8020: java.io.IOException: 
Connection reset by peer Failed on local exception: java.io.IOException: 
Connection reset by peer; Host Details : local host is: "zk2/52.82.32.47.2"; 
destination host is: "zk2":8020;
2019-11-24 09:00:32,482 INFO ha.HealthMonitor 
(HealthMonitor.java:enterState(249)) - Entering state SERVICE_NOT_RESPONDING
2019-11-24 09:00:32,482 INFO ha.ZKFailoverController 
(ZKFailoverController.java:setLastHealthState(851)) - Local service NameNode at 
zk2/52.82.32.47.2:8020 entered state: SERVICE_NOT_RESPONDING
2019-11-24 09:00:32,483 WARN tools.DFSZKFailoverController 
(DFSZKFailoverController.java:getLocalNNThreadDump(244)) - Can't get local NN 
thread dump due to Connection refused (Connection refused)
2019-11-24 09:00:32,483 INFO ha.ZKFailoverController 
(ZKFailoverController.java:recheckElectability(768)) - Quitting master election 
for NameNode at zk2/52.82.32.47.2:8020 and marking that fencing is necessary
2019-11-24 09:00:32,483 INFO ha.ActiveStandbyElector 
(ActiveStandbyElector.java:quitElection(406)) - Yielding from election
2019-11-24 09:00:32,483 INFO ha.ActiveStandbyElector 
(ActiveStandbyElector.java:terminateConnection(835)) - Terminating ZK 
connection for elector id=324457684 
appData=0a026e7312036e6e311a037a6b3220d43e28d33e cb=Elector callbacks for 
NameNode at zk2/52.82.32.47.2:8020
2019-11-24 09:00:32,514 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - 
Session: 0x169c07336613aab closed
2019-11-24 09:00:32,581 WARN ha.ActiveStandbyElector 
(ActiveStandbyElector.java:isStaleClient(1093)) - Ignoring stale result from 
old client with sessionId 0x169c07336613aab



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to