Re: Output Committers for S3

2017-02-22 Thread Matthew Schauer
Well, the issue I'm trying to solve is slow writing due to S3's implementation of move as copy/delete. It seems like your S3 committers and S3Guard both ameliorate that somewhat by parallelizing the copy. I assume there's no better way to solve this issue without sacrificing safety. Even if ther

Re: Output Committers for S3

2017-02-21 Thread Matthew Schauer
Thanks for the repo, Ryan! I had heard that Netflix had a committer that used the local filesystem as a temporary store, but I wasn't able to find that anywhere until now. I implemented something similar that writes to HDFS and then copies to S3, but it doesn't use the multipart upload API, so I'

Output Committers for S3

2017-02-20 Thread Matthew Schauer
I'm using Spark 1.5.2 and trying to append a data frame to partitioned Parquet directory in S3. It is known that the default `ParquetOutputCommitter` performs poorly in S3 because move is implemented as copy/delete, but the `DirectParquetOutputCommitter` is not safe to use for append operations in