On 12 Feb 2018, at 20:21, Ryan Blue <rb...@netflix.com<mailto:rb...@netflix.com>> wrote:
I wouldn't say we have a primary failure mode that we deal with. What we concluded was that all the schemes we came up with to avoid corruption couldn't cover all cases. For example, what about when memory holding a value is corrupted just before it is handed off to the writer? That's why we track down the source of the corruption and remove it from our clusters and let Amazon know to remove the instance from the hardware pool. We also structure our ETL so we have some time to reprocess. I see. I could remove memory/disk buffering of the blocks as a source of corruption leaving only working memory failures which somehow get past ECC, or bus errors of some form. Filed https://issues.apache.org/jira/browse/HADOOP-15224 to add to the todo list, Hadoop >= 3.2