The reason I want an RDD is because I'm assuming that iterating the individual elements of an RDD on the driver of the cluster is much slower than coming up with the mean and standard deviation using a map-reduce-based algorithm.
I don't know the intimate details of Spark's implementation, but it seems like each iterable element would need to be serialized and sent to the driver who would maintain the state (count, sum, total deviation from mean, etc), which is a lot of network traffic. -Kris On Fri, Aug 1, 2014 at 2:57 PM, Sean Owen <so...@cloudera.com> wrote: > On Fri, Aug 1, 2014 at 7:55 PM, kriskalish <k...@kalish.net> wrote: > > I have what seems like a relatively straightforward task to accomplish, > but I > > cannot seem to figure it out from the Spark documentation or searching > the > > mailing list. > > > > I have an RDD[(String, MyClass)] that I would like to group by the key, > and > > calculate the mean and standard deviation of the "foo" field of MyClass. > It > > "feels" like I should be able to use group by to get an RDD for each > unique > > key, but it gives me an iterable. > > Hm, why would you expect or want that? an RDD is a large distributed > data set. It's much easier to compute a mean and stdev over an > Iterable of numbers than an RDD. > > You can map your class to its double field and use anything that > operates on doubles. >