[jira] [Resolved] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling

Kihwal Lee (JIRA) Tue, 02 Apr 2013 12:31:16 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kihwal Lee resolved HDFS-3771.
------------------------------

    Resolution: Won't Fix

I'm resolving this as WONTFIX. It is not an issue in 2.0 and there is a 
workaround for 0.23 if this rare condition occurs.
                
> Namenode can't restart due to corrupt edit logs, timing issue with shutdown 
> and edit log rolling
> ------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-3771
>                 URL: https://issues.apache.org/jira/browse/HDFS-3771
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 0.23.3
>         Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, 
> using Kerberos based security
>            Reporter: patrick white
>            Priority: Critical
>
> Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty 
> issue recently, which resulted in the cluster's default Namenode being unable 
> to restart, this was on a 20 node Federated cluster with security. The cause 
> appears to be that the NN was just starting to roll its edit log when a 
> shutdown occurred, the shutdown was intentional to restart the cluster as 
> part of an automated test.
> The tests that were running do not appear to be the issue in themselves, the 
> cluster was just wrapping up an adminReport subset and this failure case has 
> not reproduce so far, nor was it failing previously. It looks like a chance 
> occurrence of sending the shutdown just as the edit log roll was begun.
> From the NN log, the following sequence is noted:
> 1. an InvalidateBlocks operation had completed
> 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
> 3. FSEditLog: Ending log segment 23963
> 4. FSEditLog: Starting log segment at 23967
> 4. NameNode: SHUTDOWN_MSG
> => the NN shuts down and then is restarted...
> 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were 
> are all in-progress
> 6. FSImageTransactionalStorageInspector: Marking log at 
> /grid/[PATH]/edits_inprogress_0000000000000023967 as corrupt since it has no 
> transactions in it.
> 7. NameNode: Exception in namenode join 
> [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
> => NN start attempts continue to cycle trying to restart but can't, failing 
> on the same exception due to lack of non-corrupt edit logs
> If observations are correct and issue is from shutdown happening as edit logs 
> are rolling, does the NN have an equivalent to the conventional fs 'sync' 
> blocking action that should be called, or perhaps has a timing hole?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling

Reply via email to