Re: ReduceByKey performance optimisation

Sean Owen Sat, 13 Sep 2014 03:52:37 -0700

If you are just looking for distinct keys, .keys.distinct() should be
much better.


On Sat, Sep 13, 2014 at 10:46 AM, Julien Carme <julien.ca...@gmail.com> wrote:
> Hello,
>
> I am facing performance issues with reduceByKey. In know that this topic has
> already been covered but I did not really find answers to my question.
>
> I am using reduceByKey to remove entries with identical keys, using, as
> reduce function, (a,b) => a. It seems to be a relatively straightforward use
> of reduceByKey, but performances on moderately big RDDs (some tens of
> millions of line) are very low, far from what you can reach with mono-server
> computing packages like R for example.
>
> I have read on other threads on the topic that reduceByKey always entirely
> shuffle the whole data. Is that true ? So it means that a custom
> partitionning could not help, right? In my case, I could relatively easily
> grant that two identical keys would always be on the same partition,
> therefore an option could by to use mapPartition and reeimplement reduce
> locally, but I would like to know if there are simpler / more elegant
> alternatives.
>
> Thanks for your help,

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: ReduceByKey performance optimisation

Reply via email to