On 12 Feb 2018, at 20:21, Ryan Blue 
<rb...@netflix.com<mailto:rb...@netflix.com>> wrote:

I wouldn't say we have a primary failure mode that we deal with. What we 
concluded was that all the schemes we came up with to avoid corruption couldn't 
cover all cases. For example, what about when memory holding a value is 
corrupted just before it is handed off to the writer?

That's why we track down the source of the corruption and remove it from our 
clusters and let Amazon know to remove the instance from the hardware pool. We 
also structure our ETL so we have some time to reprocess.


I see.

I could remove memory/disk buffering of the blocks as a source of corruption 
leaving only working memory  failures which somehow get past ECC, or bus errors 
of some form.

Filed https://issues.apache.org/jira/browse/HADOOP-15224 to add to the todo 
list, Hadoop >= 3.2



Reply via email to