Thanks for the repo, Ryan!  I had heard that Netflix had a committer that
used the local filesystem as a temporary store, but I wasn't able to find
that anywhere until now.  I implemented something similar that writes to
HDFS and then copies to S3, but it doesn't use the multipart upload API, so
I'm sure yours will be faster.  I think this is the best thing until S3Guard
comes out.

As far as my UUID-tracking approach goes, I was under the impression that a
given task would write the same set of files on each attempt.  Thus, if the
task fails, either the whole job is aborted and the files are removed, or
the task is retried and the files are overwritten.  On the other and, I can
see how having partially-written data visible to readers immediately could
cause problems, and that is a good reason to avoid my approach.

Steve -- that design document was a very enlightening read.  I will be
interested in following and possibly contributing to S3Guard in the future.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Output-Committers-for-S3-tp21033p21041.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to