val res = rdd.map(t => (t._1, (t._2.foo, 1))).reduceByKey((x,y) => (x._1+x._2, y._1+y._2)).collect
This gives you a list of (key, (tot, count)), which you can easily calculate the mean. Not sure about variance. On Fri, Aug 1, 2014 at 2:55 PM, kriskalish <k...@kalish.net> wrote: > I have what seems like a relatively straightforward task to accomplish, > but I > cannot seem to figure it out from the Spark documentation or searching the > mailing list. > > I have an RDD[(String, MyClass)] that I would like to group by the key, and > calculate the mean and standard deviation of the "foo" field of MyClass. It > "feels" like I should be able to use group by to get an RDD for each unique > key, but it gives me an iterable. > > As in: > > val grouped = rdd.groupByKey() > > grouped.foreach{g => > val mean = g.map( x => x.foo).mean() > val dev = g.map( x => x.foo ).stddev() > // do fancy things with the mean and deviation > } > > However, there seems to be no way to convert the iterable into an RDD. Is > there some other technique for doing this? I'm to the point where I'm > considering copying and pasting the StatCollector class and changing the > type from Double to MyClass (or making it generic). > > Am I going down the wrong path? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >