Joshua Caplan created HADOOP-9577:
-------------------------------------
Summary: Actual data loss using s3n
Key: HADOOP-9577
URL: https://issues.apache.org/jira/browse/HADOOP-9577
Project: Hadoop Common
Issue Type: Bug
Reporter: Joshua Caplan
Priority: Critical
The implementation of needsTaskCommit() assumes that the FileSystem used for
writing temporary outputs is consistent. That happens not to be the case when
using the S3 native filesystem in the US Standard region. It is actually quite
common in larger jobs for the exists() call to return false even if the task
attempt wrote output minutes earlier, which essentially cancels the commit
operation with no error. That's real life data loss right there, folks.
The saddest part is that the Hadoop APIs do not seem to provide any legitimate
means for the various RecordWriters to communicate with the OutputCommitter.
In my projects I have created a static map of semaphores keyed by
TaskAttemptID, which all my custom RecordWriters have to be aware of. That's
pretty lame.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira