[ https://issues.apache.org/jira/browse/HDFS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eli Collins resolved HDFS-2742. ------------------------------- Resolution: Fixed Hadoop Flags: Reviewed I've committed this. > HA: observed dataloss in replication stress test > ------------------------------------------------ > > Key: HDFS-2742 > URL: https://issues.apache.org/jira/browse/HDFS-2742 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: data-node, ha, name-node > Affects Versions: HA branch (HDFS-1623) > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Priority: Blocker > Attachments: hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, > hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, > log-colorized.txt > > > The replication stress test case failed over the weekend since one of the > replicas went missing. Still diagnosing the issue, but it seems like the > chain of events was something like: > - a block report was generated on one of the nodes while the block was being > written - thus the block report listed the block as RBW > - when the standby replayed this queued message, it was replayed after the > file was marked complete. Thus it marked this replica as corrupt > - it asked the DN holding the corrupt replica to delete it. And, I think, > removed it from the block map at this time. > - That DN then did another block report before receiving the deletion. This > caused it to be re-added to the block map, since it was "FINALIZED" now. > - Replication was lowered on the file, and it counted the above replica as > non-corrupt, and asked for the other replicas to be deleted. > - All replicas were lost. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira