[jira] [Resolved] (HDFS-2742) HA: observed dataloss in replication stress test

Eli Collins (Resolved) (JIRA) Tue, 31 Jan 2012 21:19:43 -0800

     [ 
https://issues.apache.org/jira/browse/HDFS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eli Collins resolved HDFS-2742.
-------------------------------

      Resolution: Fixed
    Hadoop Flags: Reviewed

I've committed this.
                
> HA: observed dataloss in replication stress test
> ------------------------------------------------
>
>                 Key: HDFS-2742
>                 URL: https://issues.apache.org/jira/browse/HDFS-2742
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: data-node, ha, name-node
>    Affects Versions: HA branch (HDFS-1623)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Blocker
>         Attachments: hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, 
> hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, 
> log-colorized.txt
>
>
> The replication stress test case failed over the weekend since one of the 
> replicas went missing. Still diagnosing the issue, but it seems like the 
> chain of events was something like:
> - a block report was generated on one of the nodes while the block was being 
> written - thus the block report listed the block as RBW
> - when the standby replayed this queued message, it was replayed after the 
> file was marked complete. Thus it marked this replica as corrupt
> - it asked the DN holding the corrupt replica to delete it. And, I think, 
> removed it from the block map at this time.
> - That DN then did another block report before receiving the deletion. This 
> caused it to be re-added to the block map, since it was "FINALIZED" now.
> - Replication was lowered on the file, and it counted the above replica as 
> non-corrupt, and asked for the other replicas to be deleted.
> - All replicas were lost.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HDFS-2742) HA: observed dataloss in replication stress test

Reply via email to