Hi, If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can pass shuffle = true. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is). Regards, anish On 8/14/15, Alexander Pivovarov <apivova...@gmail.com> wrote: > Hi Everyone > > Which one should work faster (coalesce or repartition) if I need to reduce > number of partitions from 5000 to 3 before saving RDD asTextFile > > Total data size is about 400MB on disk in text format > > Thank you > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org