Sorry Du,
Repartition means coalesce(shuffle = true) as per [1]. They are the same
operation. Coalescing with shuffle = false means you are specifying the max
amount of partitions after the coalesce (if there are less partitions you
will end up with the lesser amount.
[1]
https://github.com/apac
repartition() means coalesce(shuffle=false)
On Thursday, June 18, 2015 4:07 PM, Corey Nolet wrote:
Doesn't repartition call coalesce(shuffle=true)?On Jun 18, 2015 6:53 PM, "Du
Li" wrote:
I got the same problem with rdd,repartition() in my streaming app, which
generated a few huge
Doesn't repartition call coalesce(shuffle=true)?
On Jun 18, 2015 6:53 PM, "Du Li" wrote:
> I got the same problem with rdd,repartition() in my streaming app, which
> generated a few huge partitions and many tiny partitions. The resulting
> high data skew makes the processing time of a batch unpre
I got the same problem with rdd,repartition() in my streaming app, which
generated a few huge partitions and many tiny partitions. The resulting high
data skew makes the processing time of a batch unpredictable and often
exceeding the batch interval. I eventually solved the problem by using
rdd
Thanks for the suggestion. Repartition didn't help us unfortunately. It
still puts everything into the same partition.
We did manage to improve the situation by making a new partitioner that
extends HashPartitioner. It treats certain "exception" keys differently.
These keys that are known to a
Can you try repartitioning the rdd after creating the K,V. And also, while
calling the rdd1.join(rdd2, Pass the # partition argument too)
Thanks
Best Regards
On Wed, Jun 17, 2015 at 12:15 PM, Al M wrote:
> I have 2 RDDs I want to Join. We will call them RDD A and RDD B. RDD A
> has
> 1 billio