Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Corey Nolet
Sorry Du, Repartition means coalesce(shuffle = true) as per [1]. They are the same operation. Coalescing with shuffle = false means you are specifying the max amount of partitions after the coalesce (if there are less partitions you will end up with the lesser amount. [1] https://github.com/apac

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Du Li
repartition() means coalesce(shuffle=false) On Thursday, June 18, 2015 4:07 PM, Corey Nolet wrote: Doesn't repartition call coalesce(shuffle=true)?On Jun 18, 2015 6:53 PM, "Du Li" wrote: I got the same problem with rdd,repartition() in my streaming app, which generated a few huge

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Corey Nolet
Doesn't repartition call coalesce(shuffle=true)? On Jun 18, 2015 6:53 PM, "Du Li" wrote: > I got the same problem with rdd,repartition() in my streaming app, which > generated a few huge partitions and many tiny partitions. The resulting > high data skew makes the processing time of a batch unpre

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Du Li
I got the same problem with rdd,repartition() in my streaming app, which generated a few huge partitions and many tiny partitions. The resulting high data skew makes the processing time of a batch unpredictable and often exceeding the batch interval. I eventually solved the problem by using rdd

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Al M
Thanks for the suggestion. Repartition didn't help us unfortunately. It still puts everything into the same partition. We did manage to improve the situation by making a new partitioner that extends HashPartitioner. It treats certain "exception" keys differently. These keys that are known to a

Re: Shuffle produces one huge partition

2015-06-17 Thread Akhil Das
Can you try repartitioning the rdd after creating the K,V. And also, while calling the rdd1.join(rdd2, Pass the # partition argument too) Thanks Best Regards On Wed, Jun 17, 2015 at 12:15 PM, Al M wrote: > I have 2 RDDs I want to Join. We will call them RDD A and RDD B. RDD A > has > 1 billio