On 8 Apr 2017, at 10:36, Yash Sharma
mailto:yash...@gmail.com>> wrote:
Very interesting. I will give it a try. Thanks for pointing this.
Also, are you planning to contribute it to spark, and could it be a good
default option for spark S3 copies ?
It's going into Hadoop core itself, HADOOP-137
Very interesting. I will give it a try. Thanks for pointing this.
Also, are you planning to contribute it to spark, and could it be a good
default option for spark S3 copies ?
Have you got any bench marking that could show the improvements in the job.
Thanks,
Yash
On Sat, 8 Apr 2017 at 02:38 Ryan
Yash,
We (Netflix) built a committer that uses the S3 multipart upload API to
avoid the copy problem and still handle task failures. You can build and
use the copy posted here:
https://github.com/rdblue/s3committer
You're probably interested in the S3PartitionedOutputCommitter.
rb
On Thu, Ap
Hi All,
This is another issue that I was facing with the spark - s3 operability and
wanted to ask to the broader community if its faced by anyone else.
I have a rather simple aggregation query with a basic transformation. The
output however has lot of output partitions (20K partitions). The spark