On 20 Feb 2017, at 18:14, Matthew Schauer <matthew.scha...@ibm.com<mailto:matthew.scha...@ibm.com>> wrote:
I'm using Spark 1.5.2 and trying to append a data frame to partitioned Parquet directory in S3. It is known that the default `ParquetOutputCommitter` performs poorly in S3 because move is implemented as copy/delete, but the `DirectParquetOutputCommitter` is not safe to use for append operations in case of failure. I'm not very familiar with the intricacies of job/task committing/aborting, but I've written a rough replacement output committer that seems to work. It writes the results directly to their final locations and uses the write UUID to determine which files to remove in the case of a job/task abort. It seems to be a workable concept in the simple tests that I've tried. However, I can't make Spark use this alternate output committer because the changes in SPARK-8578 categorically prohibit any custom output committer from being used, even if it's safe for appending. I have two questions: 1) Does anyone more familiar with output committing have any feedback on my proposed "safe" append strategy, and 2) is there any way to circumvent the restriction on append committers without editing and recompiling Spark? Discussion of solutions in Spark 2.1 is also welcome. Matthew, as part of the S3guard committer I'm doing in the Hadoop codebase (which requires a consistent object store implemented natively or via a dynamo db database), I'm modifying FileOutputFormat to take alternate committers underneath. Algorithm https://github.com/steveloughran/hadoop/blob/s3guard/HADOOP-13786-committer/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3a_committer.md Code: https://github.com/steveloughran/hadoop/tree/s3guard/HADOOP-13786-committer/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit Modified FOF: https://github.com/steveloughran/hadoop/tree/s3guard/HADOOP-13786-committer/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output Current status: getting the low level tests at the MR layer working. Spark committer exists to the point of compiling, but not yet tested. If you do want to get involved; the JIRA is: https://issues.apache.org/jira/browse/HADOOP-13786 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Output-Committers-for-S3-tp21033.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com<http://Nabble.com>. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>