Re: [Discuss][Spark staging dir] way to disable spark writing to _temporary

2017-04-08 Thread Steve Loughran
On 8 Apr 2017, at 10:36, Yash Sharma mailto:yash...@gmail.com>> wrote: Very interesting. I will give it a try. Thanks for pointing this. Also, are you planning to contribute it to spark, and could it be a good default option for spark S3 copies ? It's going into Hadoop core itself, HADOOP-137

Re: [Discuss][Spark staging dir] way to disable spark writing to _temporary

2017-04-08 Thread Yash Sharma
Very interesting. I will give it a try. Thanks for pointing this. Also, are you planning to contribute it to spark, and could it be a good default option for spark S3 copies ? Have you got any bench marking that could show the improvements in the job. Thanks, Yash On Sat, 8 Apr 2017 at 02:38 Ryan

Re: [Discuss][Spark staging dir] way to disable spark writing to _temporary

2017-04-07 Thread Ryan Blue
Yash, We (Netflix) built a committer that uses the S3 multipart upload API to avoid the copy problem and still handle task failures. You can build and use the copy posted here: https://github.com/rdblue/s3committer You're probably interested in the S3PartitionedOutputCommitter. rb On Thu, Ap

[Discuss][Spark staging dir] way to disable spark writing to _temporary

2017-04-06 Thread Yash Sharma
Hi All, This is another issue that I was facing with the spark - s3 operability and wanted to ask to the broader community if its faced by anyone else. I have a rather simple aggregation query with a basic transformation. The output however has lot of output partitions (20K partitions). The spark