Re: Output Committers for S3

Steve Loughran Tue, 21 Feb 2017 05:52:30 -0800

On 20 Feb 2017, at 18:14, Matthew Schauer 
<[email protected]<mailto:[email protected]>> wrote:


I'm using Spark 1.5.2 and trying to append a data frame to partitioned
Parquet directory in S3.  It is known that the default
`ParquetOutputCommitter` performs poorly in S3 because move is implemented
as copy/delete, but the `DirectParquetOutputCommitter` is not safe to use
for append operations in case of failure.  I'm not very familiar with the
intricacies of job/task committing/aborting, but I've written a rough
replacement output committer that seems to work.  It writes the results
directly to their final locations and uses the write UUID to determine which
files to remove in the case of a job/task abort.  It seems to be a workable
concept in the simple tests that I've tried.  However, I can't make Spark
use this alternate output committer because the changes in SPARK-8578
categorically prohibit any custom output committer from being used, even if
it's safe for appending.  I have two questions: 1) Does anyone more familiar
with output committing have any feedback on my proposed "safe" append
strategy, and 2) is there any way to circumvent the restriction on append
committers without editing and recompiling Spark?  Discussion of solutions
in Spark 2.1 is also welcome.




Matthew, as part of the S3guard committer I'm doing in the Hadoop codebase 
(which requires a consistent object store implemented natively or via a dynamo 
db database), I'm modifying FileOutputFormat to take alternate committers 
underneath.

Algorithm
https://github.com/steveloughran/hadoop/blob/s3guard/HADOOP-13786-committer/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3a_committer.md

Code:

https://github.com/steveloughran/hadoop/tree/s3guard/HADOOP-13786-committer/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit


Modified FOF: 
https://github.com/steveloughran/hadoop/tree/s3guard/HADOOP-13786-committer/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output

Current status: getting the low level tests at the MR layer working. Spark 
committer exists to the point of compiling, but not yet tested. If you do want 
to get involved; the JIRA is: https://issues.apache.org/jira/browse/HADOOP-13786



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Output-Committers-for-S3-tp21033.html
Sent from the Apache Spark Developers List mailing list archive at 
Nabble.com<http://Nabble.com>.

---------------------------------------------------------------------
To unsubscribe e-mail: 
[email protected]<mailto:[email protected]>

Re: Output Committers for S3

Reply via email to