+
Subject: aggregateByKey vs combineByKey
From: mmistr...@gmail.com
To: user@spark.apache.org
Hi all
i have the following dataSet
kv = [(2,Hi), (1,i), (2,am), (1,a), (4,test), (6,s
tring)]
It's a simple list of tuples containing (word_length, word)
What i wanted to do was to group the
Looking at PairRDDFunctions.scala :
def aggregateByKey[U: ClassTag](zeroValue: U, partitioner:
Partitioner)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
...
combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
cleanedSeqOp, combOp, part
Hi all
i have the following dataSet
kv = [(2,Hi), (1,i), (2,am), (1,a), (4,test), (6,s tring)]
It's a simple list of tuples containing (word_length, word)
What i wanted to do was to group the result by key in order to have a
result in the form
[(word_length_1, [word1, word2, word3], word_length
Thanks Liquan, that was really helpful.
On Mon, Sep 29, 2014 at 5:54 PM, Liquan Pei wrote:
> Hi Dave,
>
> You can replace groupByKey with reduceByKey to improve performance in some
> cases. reduceByKey performs map side combine which can reduce Network IO
> and shuffle size where as groupByKey w
Hi Dave,
You can replace groupByKey with reduceByKey to improve performance in some
cases. reduceByKey performs map side combine which can reduce Network IO
and shuffle size where as groupByKey will not perform map side combine.
combineByKey is more general then aggregateByKey. Actually, the
impl
Hi All,
After some hair pulling, I've reached the realisation that an operation I
am currently doing via:
myRDD.groupByKey.mapValues(func)
should be done more efficiently using aggregateByKey or combineByKey. Both
of these methods would do, and they seem very similar to me in terms of
their func