If you are just looking for distinct keys, .keys.distinct() should be much better.
On Sat, Sep 13, 2014 at 10:46 AM, Julien Carme <julien.ca...@gmail.com> wrote: > Hello, > > I am facing performance issues with reduceByKey. In know that this topic has > already been covered but I did not really find answers to my question. > > I am using reduceByKey to remove entries with identical keys, using, as > reduce function, (a,b) => a. It seems to be a relatively straightforward use > of reduceByKey, but performances on moderately big RDDs (some tens of > millions of line) are very low, far from what you can reach with mono-server > computing packages like R for example. > > I have read on other threads on the topic that reduceByKey always entirely > shuffle the whole data. Is that true ? So it means that a custom > partitionning could not help, right? In my case, I could relatively easily > grant that two identical keys would always be on the same partition, > therefore an option could by to use mapPartition and reeimplement reduce > locally, but I would like to know if there are simpler / more elegant > alternatives. > > Thanks for your help, --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org