Re: Support for skewed joins in Spark

2015-05-04 Thread ๏̯͡๏
Hello Soila, Can you share the code that shows usuag of RangePartitioner ? I am facing issue with .join() where one task runs forever. I tried repartition(100/200/300/1200) and it did not help, I cannot use map-side join because both datasets are huge and beyond driver memory size. Regards, Deepak

Re: Support for skewed joins in Spark

2015-03-12 Thread Soila Pertet Kavulya
Thanks Shixiong, I'll try out your PR. Do you know what the status of the PR is? Are there any plans to incorporate this change to the DataFrames/SchemaRDDs in Spark 1.3? Soila On Thu, Mar 12, 2015 at 7:52 PM, Shixiong Zhu wrote: > I sent a PR to add skewed join last year: > https://github.com/

Re: Support for skewed joins in Spark

2015-03-12 Thread Shixiong Zhu
I sent a PR to add skewed join last year: https://github.com/apache/spark/pull/3505 However, it does not split a key to multiple partitions. Instead, if a key has too many values that can not be fit in to memory, it will store the values into the disk temporarily and use disk files to do the join.