Re: Computing mean and standard deviation by key

Kristopher Kalish Fri, 01 Aug 2014 12:37:35 -0700

The reason I want an RDD is because I'm assuming that iterating the
individual elements of an RDD on the driver of the cluster is much slower
than coming up with the mean and standard deviation using a
map-reduce-based algorithm.


I don't know the intimate details of Spark's implementation, but it seems
like each iterable element would need to be serialized and sent to the
driver who would maintain the state (count, sum, total deviation from mean,
etc), which is a lot of network traffic.

-Kris


On Fri, Aug 1, 2014 at 2:57 PM, Sean Owen <so...@cloudera.com> wrote:

> On Fri, Aug 1, 2014 at 7:55 PM, kriskalish <k...@kalish.net> wrote:
> > I have what seems like a relatively straightforward task to accomplish,
> but I
> > cannot seem to figure it out from the Spark documentation or searching
> the
> > mailing list.
> >
> > I have an RDD[(String, MyClass)] that I would like to group by the key,
> and
> > calculate the mean and standard deviation of the "foo" field of MyClass.
> It
> > "feels" like I should be able to use group by to get an RDD for each
> unique
> > key, but it gives me an iterable.
>
> Hm, why would you expect or want that? an RDD is a large distributed
> data set. It's much easier to compute a mean and stdev over an
> Iterable of numbers than an RDD.
>
> You can map your class to its double field and use anything that
> operates on doubles.
>

Re: Computing mean and standard deviation by key

Reply via email to