OK, mapPartition seems to be the way to go. Thanks for the help!
Le 13 sept. 2014 16:41, "Sean Owen" a écrit :
> This is more concise:
>
> x.groupBy(obj.fieldtobekey).values.map(_.head)
>
> ... but I doubt it's faster.
>
> If all objects with the same fieldtobekey are within the same
> partition
This is more concise:
x.groupBy(obj.fieldtobekey).values.map(_.head)
... but I doubt it's faster.
If all objects with the same fieldtobekey are within the same
partition, then yes I imagine your biggest speedup comes from
exploiting that. How about ...
x.mapPartitions(_.map(obj => (obj.fieldtob
You need something like:
val x: RDD[MyAwesomeObject]
x.map(obj => obj.fieldtobekey -> obj).reduceByKey { case (l, _) => l }
Does that make sense?
On Sat, Sep 13, 2014 at 7:28 AM, Julien Carme
wrote:
> I need to remove objects with duplicate key, but I need the whole object.
> Object which ha
I need to remove objects with duplicate key, but I need the whole object.
Object which have the same key are not necessarily equal, though (but I can
dump any on the ones that have identical key).
2014-09-13 12:50 GMT+02:00 Sean Owen :
> If you are just looking for distinct keys, .keys.distinct()
If you are just looking for distinct keys, .keys.distinct() should be
much better.
On Sat, Sep 13, 2014 at 10:46 AM, Julien Carme wrote:
> Hello,
>
> I am facing performance issues with reduceByKey. In know that this topic has
> already been covered but I did not really find answers to my questio
Hello,
I am facing performance issues with reduceByKey. In know that this topic
has already been covered but I did not really find answers to my question.
I am using reduceByKey to remove entries with identical keys, using, as
reduce function, (a,b) => a. It seems to be a relatively straightforwa