Stephen O'Donnell created HDFS-11215:
----------------------------------------

             Summary: Log Namenode thread dump on unexpected exits
                 Key: HDFS-11215
                 URL: https://issues.apache.org/jira/browse/HDFS-11215
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: namenode
            Reporter: Stephen O'Donnell


With HA namenodes, reasonably frequently we can see a namenode exit due to a 
quorum of journals not responding inside the 20 second timeout, for example:

{code}
2016-11-29 01:43:22,969  WARN 
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms 
(timeout=20000 ms) for a response for sendEdits. No responses yet.
2016-11-29 01:43:23,954  FATAL 
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for 
required journal (JournalAndStream(mgr=QJM to [10.x.x.x:8485, 10.x.x.x:8485, 
10.x.x.x:8485], stream=QuorumOutputStream starting at txid 762756576))
2016-11-29 01:43:23,954  FATAL 
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for 
required journal (JournalAndStream(mgr=QJM to [10.x.x.x:8485, 10.x.x.x:8485, 
10.x.x.x:8485], stream=QuorumOutputStream starting at txid 762756576))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
        at 
org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
        at 
org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
        at 
org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
        at 
org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
        at 
org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
        at 
org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
        at 
org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
        at 
org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2687)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2559)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:592)
        at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.create(AuthorizationProviderProxyClientProtocol.java:110)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:395)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1707)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
2016-11-29 01:43:23,955  WARN 
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting 
QuorumOutputStream starting at txid 762756576
2016-11-29 01:43:23,987  INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1
2016-11-29 01:43:24,003  INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
SHUTDOWN_MSG: 
{code}

When this happens, it can often be a mystery around what caused it. Eg, was it 
a network issue, was the thread blocked waiting on a response from a KDC (due 
to no kdc_timeout set), was it slow disk (on the Journal or even log4j on the 
Namenode), etc.

I wonder if it would make sense to log a thread dump to the Namenodes log just 
before exiting, as this may give some clues about what caused the lack of 
response from the Journals?

There are at least two ways to do this:

1. In JournalSet.java, it seems to log the fatal error and then call terminate 
- we could log a thread dump here just before calling terminate

2. We could change the terminate method in ExitUtil.java to log a thread dump 
for any terminate where the exit status is non zero, as there may be other 
scenarios where this could be useful.

There is existing code to log a thread dump in ReflectionUtils, so it could be 
reused here.

I could have a go at a patch if anyone things this is useful?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to