How to redistribute dataset without full shuffle

Artur R Fri, 17 Mar 2017 14:52:35 -0700

Hi!

I use Spark heavily for various workloads and always fall in the situation
when there is some skewed dataset (without any partitioner assigned) and I
just want to "redistribute" its data more evenly.


For example, say there is RDD of X partitions with Y rows on each except
one large partition with Y * 10 rows. I don't want to change number of
partitions, only redistribute it. Obviously, such operation should not send
more than ~Y * 9 rows across the network.
But the only option available is repartition that requires full shuffle
that takes ALL (X * Y) rows.

The question: why there is no such operation like "redistribute"?

How to redistribute dataset without full shuffle

Reply via email to