Re: ReduceByKey performance optimisation

Julien Carme Sat, 13 Sep 2014 09:53:44 -0700

OK,  mapPartition seems to be the way to go. Thanks for the help!
Le 13 sept. 2014 16:41, "Sean Owen" <so...@cloudera.com> a écrit :


> This is more concise:
>
> x.groupBy(obj.fieldtobekey).values.map(_.head)
>
> ... but I doubt it's faster.
>
> If all objects with the same fieldtobekey are within the same
> partition, then yes I imagine your biggest speedup comes from
> exploiting that. How about ...
>
> x.mapPartitions(_.map(obj => (obj.fieldtobekey, obj)).toMap.values)
>
> This does require that all keys, plus a representative object each,
> fits in memory.
> I bet you can make it faster than this example too.
>
>
> On Sat, Sep 13, 2014 at 1:15 PM, Gary Malouf <malouf.g...@gmail.com>
> wrote:
> > You need something like:
> >
> > val x: RDD[MyAwesomeObject]
> >
> > x.map(obj => obj.fieldtobekey -> obj).reduceByKey { case (l, _) => l }
> >
> > Does that make sense?
> >
> >
> > On Sat, Sep 13, 2014 at 7:28 AM, Julien Carme <julien.ca...@gmail.com>
> > wrote:
> >>
> >> I need to remove objects with duplicate key, but I need the whole
> object.
> >> Object which have the same key are not necessarily equal, though (but I
> can
> >> dump any on the ones that have identical key).
> >>
> >> 2014-09-13 12:50 GMT+02:00 Sean Owen <so...@cloudera.com>:
> >>>
> >>> If you are just looking for distinct keys, .keys.distinct() should be
> >>> much better.
> >>>
> >>> On Sat, Sep 13, 2014 at 10:46 AM, Julien Carme <julien.ca...@gmail.com
> >
> >>> wrote:
> >>> > Hello,
> >>> >
> >>> > I am facing performance issues with reduceByKey. In know that this
> >>> > topic has
> >>> > already been covered but I did not really find answers to my
> question.
> >>> >
> >>> > I am using reduceByKey to remove entries with identical keys, using,
> as
> >>> > reduce function, (a,b) => a. It seems to be a relatively
> >>> > straightforward use
> >>> > of reduceByKey, but performances on moderately big RDDs (some tens of
> >>> > millions of line) are very low, far from what you can reach with
> >>> > mono-server
> >>> > computing packages like R for example.
> >>> >
> >>> > I have read on other threads on the topic that reduceByKey always
> >>> > entirely
> >>> > shuffle the whole data. Is that true ? So it means that a custom
> >>> > partitionning could not help, right? In my case, I could relatively
> >>> > easily
> >>> > grant that two identical keys would always be on the same partition,
> >>> > therefore an option could by to use mapPartition and reeimplement
> >>> > reduce
> >>> > locally, but I would like to know if there are simpler / more elegant
> >>> > alternatives.
> >>> >
> >>> > Thanks for your help,
> >>
> >>
> >
>

Re: ReduceByKey performance optimisation

Reply via email to