Re: Can not control bucket files number if it was speficed

2016-09-19 Thread Qiang Li
Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising f

Re: Can not control bucket files number if it was speficed

2016-09-17 Thread Qiang Li
AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical cont

Can not control bucket files number if it was speficed

2016-09-17 Thread Qiang Li
Hi, I use spark to generate data , then we use hive/pig/presto/spark to analyze data, but I found even I add used bucketBy and sortBy with bucket number in Spark, the results files was generate by Spark is always far more than bucket number under each partition, then Presto can not recognize the b

Re: Spark output data to S3 is very slow

2016-09-17 Thread Qiang Li
ive.com/user@spark.apache.org/msg56791.html > > // maropu > > > On Sat, Sep 17, 2016 at 11:34 AM, Qiang Li wrote: > >> Hi, >> >> >> I ran some jobs with Spark 2.0 on Yarn, I found all tasks finished very >> quickly, but the last step, spark spend lots of time t

Spark output data to S3 is very slow

2016-09-16 Thread Qiang Li
Hi, I ran some jobs with Spark 2.0 on Yarn, I found all tasks finished very quickly, but the last step, spark spend lots of time to rename or move data from s3 temporary directory to real directory, then I try to set spark.hadoop.spark.sql.parquet.output.committer.class=org.apache.spark.sql.exec