[ https://issues.apache.org/jira/browse/HDFS-7451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinayakumar B resolved HDFS-7451. --------------------------------- Resolution: Not a Problem > Namenode HA failover happens very frequently from active to standby > ------------------------------------------------------------------- > > Key: HDFS-7451 > URL: https://issues.apache.org/jira/browse/HDFS-7451 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: LAXMAN KUMAR SAHOO > Assignee: LAXMAN KUMAR SAHOO > > We have two namenode having HA enabled. From last couple of days we are > observing that the failover happens very frequently from active to standby > mode. Below is the log details of the active namenode during failover > happens. Is there any fix to get rid of this issue? > Namenode logs: > {code} > 2014-11-25 22:24:02,020 WARN org.apache.hadoop.ipc.Server: IPC Server > Responder, call org.apache.hadoop.hdfs.protocol.Clie > ntProtocol.getListing from 10.2.16.214:40751: output error > 2014-11-25 22:24:02,020 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 23 on 8020 caught an exception > java.nio.channels.ClosedChannelException > at > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474) > at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2195) > at org.apache.hadoop.ipc.Server.access$2000(Server.java:110) > at > org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:979) > at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1045) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1798) > 2014-11-25 22:24:10,631 INFO > org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits > file /sda/dfs/namenode/current/edits_inprogress_0000000001643676954 -> > /sda/dfs/namenode/current/edits_0000000001643676954-0000000001643677390 > 2014-11-25 22:24:10,631 INFO > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Closing > java.lang.Exception > at > org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel.close(IPCLoggerChannel.java:182) > at > org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.close(AsyncLoggerSet.java:102) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.close(QuorumJournalManager.java:446) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.close(JournalSet.java:107) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet$4.apply(JournalSet.java:222) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:347) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.close(JournalSet.java:219) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.close(FSEditLog.java:308) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.stopActiveServices(FSNamesystem.java:939) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.stopActiveServices(NameNode.java:1365) > at > org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.exitState(ActiveState.java:70) > at > org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:61) > at > org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.setState(ActiveState.java:52) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToStandby(NameNode.java:1278) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToStandby(NameNodeRpcServer.java:1046) > at > org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToStandby(HAServiceProtocolServerSideTranslatorPB.java:119) > at > org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3635) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1752) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1748) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > 2014-11-25 22:24:10,632 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services > required for standby state > 2014-11-25 22:24:10,633 INFO > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Will roll logs on > active node at dc1-had03-m002.dc01.revsci.net/10.2.16.92:8020 every 120 > seconds. > 2014-11-25 22:24:10,634 INFO > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Starting > standby checkpoint thread... > Checkpointing active NN at dc1-had03-m002.dc01.revsci.net:50070 > Serving checkpoints at dc1-had03-m001.dc01.revsci.net/10.2.16.91:50070 > {code} > zkfc logs: > {code} > 2014-11-25 22:24:12,192 INFO org.apache.zookeeper.ClientCnxn: Unable to read > additional data from server sessionid 0x449b8 > ce9a110255, likely server has closed socket, closing socket connection and > attempting reconnect > 2014-11-25 22:24:12,293 INFO org.apache.hadoop.ha.ActiveStandbyElector: > Session disconnected. Entering neutral mode... > 2014-11-25 22:24:12,950 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server dc1-had03-zook06.dc01.re > vsci.net/10.2.16.205:2181. Will not attempt to authenticate using SASL > (unknown error) > 2014-11-25 22:24:12,951 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to dc1-had03-zook06.dc01.revsc > i.net/10.2.16.205:2181, initiating session > 2014-11-25 22:24:12,952 INFO org.apache.zookeeper.ClientCnxn: Unable to > reconnect to ZooKeeper service, session 0x449b8ce9 > a110255 has expired, closing socket connection > 2014-11-25 22:24:12,952 INFO org.apache.hadoop.ha.ActiveStandbyElector: > Session expired. Entering neutral mode and rejoini > ng... > 2014-11-25 22:24:12,952 INFO org.apache.hadoop.ha.ActiveStandbyElector: > Trying to re-establish ZK session > 2014-11-25 22:24:12,952 INFO org.apache.zookeeper.ZooKeeper: Initiating > client connection, connectString=dc1-had03-zook01. > dc01.revsci.net:2181,dc1-had03-zook02.dc01.revsci.net:2181,dc1-had03-zook03.dc01.revsci.net:2181,dc1-had03-zook04.dc01.rev > sci.net:2181,dc1-had03-zook05.dc01.revsci.net:2181,dc1-had03-zook06.dc01.revsci.net:2181 > sessionTimeout=5000 watcher=org.a > pache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@7f042529 > 2014-11-25 22:24:12,954 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server dc1-had03-zook01.dc01.revsci.net/10.2.16.200:2181. Will > not attempt to authenticate using SASL (unknown error) > 2014-11-25 22:24:12,954 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to dc1-had03-zook01.dc01.revsci.net/10.2.16.200:2181, > initiating session > 2014-11-25 22:24:13,172 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server > dc1-had03-zook01.dc01.revsci.net/10.2.16.200:2181, sessionid = > 0x149a7f9b6d60263, negotiated timeout = 5000 > 2014-11-25 22:24:13,173 INFO org.apache.hadoop.ha.ActiveStandbyElector: > Session connected. > 2014-11-25 22:24:13,173 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x449b8ce9a110255 > 2014-11-25 22:24:13,173 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2014-11-25 22:24:13,389 INFO org.apache.hadoop.ha.ZKFailoverController: ZK > Election indicated that NameNode at > dc1-had03-m001.dc01.revsci.net/10.2.16.91:8020 should become standby > 2014-11-25 22:24:13,391 INFO org.apache.hadoop.ha.ZKFailoverController: > Successfully transitioned NameNode at > dc1-had03-m001.dc01.revsci.net/10.2.16.91:8020 to standby state > {code} > -laxman -- This message was sent by Atlassian JIRA (v6.3.4#6332)