val res = rdd.map(t => (t._1, (t._2.foo, 1))).reduceByKey((x,y) =>
(x._1+x._2, y._1+y._2)).collect

This gives you a list of (key, (tot, count)), which you can easily
calculate the mean. Not sure about variance.


On Fri, Aug 1, 2014 at 2:55 PM, kriskalish <k...@kalish.net> wrote:

> I have what seems like a relatively straightforward task to accomplish,
> but I
> cannot seem to figure it out from the Spark documentation or searching the
> mailing list.
>
> I have an RDD[(String, MyClass)] that I would like to group by the key, and
> calculate the mean and standard deviation of the "foo" field of MyClass. It
> "feels" like I should be able to use group by to get an RDD for each unique
> key, but it gives me an iterable.
>
> As in:
>
> val grouped = rdd.groupByKey()
>
> grouped.foreach{g =>
>    val mean = g.map( x => x.foo).mean()
>    val dev = g.map( x => x.foo ).stddev()
>    // do fancy things with the mean and deviation
> }
>
> However, there seems to be no way to convert the iterable into an RDD. Is
> there some other technique for doing this? I'm to the point where I'm
> considering copying and pasting the StatCollector class and changing the
> type from Double to MyClass (or making it generic).
>
> Am I going down the wrong path?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to