[ https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kihwal Lee resolved HDFS-3771. ------------------------------ Resolution: Won't Fix I'm resolving this as WONTFIX. It is not an issue in 2.0 and there is a workaround for 0.23 if this rare condition occurs. > Namenode can't restart due to corrupt edit logs, timing issue with shutdown > and edit log rolling > ------------------------------------------------------------------------------------------------ > > Key: HDFS-3771 > URL: https://issues.apache.org/jira/browse/HDFS-3771 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 0.23.3 > Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, > using Kerberos based security > Reporter: patrick white > Priority: Critical > > Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty > issue recently, which resulted in the cluster's default Namenode being unable > to restart, this was on a 20 node Federated cluster with security. The cause > appears to be that the NN was just starting to roll its edit log when a > shutdown occurred, the shutdown was intentional to restart the cluster as > part of an automated test. > The tests that were running do not appear to be the issue in themselves, the > cluster was just wrapping up an adminReport subset and this failure case has > not reproduce so far, nor was it failing previously. It looks like a chance > occurrence of sending the shutdown just as the edit log roll was begun. > From the NN log, the following sequence is noted: > 1. an InvalidateBlocks operation had completed > 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr] > 3. FSEditLog: Ending log segment 23963 > 4. FSEditLog: Starting log segment at 23967 > 4. NameNode: SHUTDOWN_MSG > => the NN shuts down and then is restarted... > 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were > are all in-progress > 6. FSImageTransactionalStorageInspector: Marking log at > /grid/[PATH]/edits_inprogress_0000000000000023967 as corrupt since it has no > transactions in it. > 7. NameNode: Exception in namenode join > [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967 > => NN start attempts continue to cycle trying to restart but can't, failing > on the same exception due to lack of non-corrupt edit logs > If observations are correct and issue is from shutdown happening as edit logs > are rolling, does the NN have an equivalent to the conventional fs 'sync' > blocking action that should be called, or perhaps has a timing hole? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira