Not very sure about the meaning of “mean of RDD by key”, is this what you
want?
val meansByKey = rdd
.map { case (k, v) =>
k -> (v, 1)
}
.reduceByKey { (lhs, rhs) =>
(lhs._1 + rhs._1, lhs._2 + rhs._2)
}
.map { case (sum, count) =>
sum / count
}
.collectAsMap()
With this,
it seems you can imitate RDD.top()'s implementation. for each partition,
you get the number of records, and the total sum of key, and in the final
result handler, you add all the sum together, and add the number of records
together, then you can get the mean, I mean, arithmetic mean.
On Tue, Apr