Hi,

If you are decreasing the number of partitions in this RDD, consider
using coalesce, which can avoid performing a shuffle.

However, if you're doing a drastic coalesce, e.g. to numPartitions =
1, this may result in your computation taking place on fewer nodes
than you like (e.g. one node in the case of numPartitions = 1). To
avoid this, you can pass shuffle = true. This will add a shuffle step,
but means the current upstream partitions will be executed in parallel
(per whatever the current partitioning is).

Regards,

anish

On 8/14/15, Alexander Pivovarov <apivova...@gmail.com> wrote:
> Hi Everyone
>
> Which one should work faster (coalesce or repartition) if I need to reduce
> number of partitions from 5000 to 3 before saving RDD asTextFile
>
> Total data size is about 400MB on disk in text format
>
> Thank you
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to