Re: Computing mean and standard deviation by key

2014-09-12 Thread rzykov
Thank you, David! It works. import org.apache.spark.util.StatCounter val a = ordersRDD.join(ordersRDD).map{case((partnerid, itemid),((matchedida, pricea), (matchedidb, priceb))) => ((matchedida, matchedidb), (if(priceb > 0) (pricea/priceb).toDouble else 0.toDouble))} .groupByKey .

Re: Computing mean and standard deviation by key

2014-09-12 Thread David Rowe
Oh I see, I think you're trying to do something like (in SQL): SELECT order, mean(price) FROM orders GROUP BY order In this case, I'm not aware of a way to use the DoubleRDDFunctions, since you have a single RDD of pairs where each pair is of type (KeyType, Iterable[Double]). It seems to me that

Re: Computing mean and standard deviation by key

2014-09-12 Thread Sean Owen
These functions operate on an RDD of Double which is not what you have, so no this is not a way to use DoubleRDDFunctions. See earlier in the thread for canonical solutions. On Sep 12, 2014 8:06 AM, "rzykov" wrote: > Tried this: > > ordersRDD.join(ordersRDD).map{case((partnerid, itemid),((matched

Re: Computing mean and standard deviation by key

2014-09-12 Thread rzykov
Tried this: ordersRDD.join(ordersRDD).map{case((partnerid, itemid),((matchedida, pricea), (matchedidb, priceb))) => ((matchedida, matchedidb), (if(priceb > 0) (pricea/priceb).toDouble else 0.toDouble))} .groupByKey .values.stats .first Error: :37: error: could not find imp

Re: Computing mean and standard deviation by key

2014-09-11 Thread David Rowe
I generally call values.stats, e.g.: val stats = myPairRdd.values.stats On Fri, Sep 12, 2014 at 4:46 PM, rzykov wrote: > Is it possible to use DoubleRDDFunctions > < > https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/rdd/DoubleRDDFunctions.html > > > for calculating mean and std d

Re: Computing mean and standard deviation by key

2014-09-11 Thread rzykov
Is it possible to use DoubleRDDFunctions for calculating mean and std dev for Paired RDDs (key, value)? Now I'm using an approach with ReduceByKey but want to make my code more concise and readable.

Re: Computing mean and standard deviation by key

2014-08-04 Thread Ron Gonzalez
Cool thanks!  On Monday, August 4, 2014 8:58 AM, kriskalish wrote: Hey Ron, It was pretty much exactly as Sean had depicted. I just needed to provide count an anonymous function to tell it which elements to count. Since I wanted to count them all, the function is simply "true".         va

Re: Computing mean and standard deviation by key

2014-08-04 Thread kriskalish
Hey Ron, It was pretty much exactly as Sean had depicted. I just needed to provide count an anonymous function to tell it which elements to count. Since I wanted to count them all, the function is simply "true". val grouped = rdd.groupByKey().mapValues { mcs => val values = mcs

Re: Computing mean and standard deviation by key

2014-08-01 Thread Ron Gonzalez
Can you share the mapValues approach you did? Thanks, Ron Sent from my iPhone > On Aug 1, 2014, at 3:00 PM, kriskalish wrote: > > Thanks for the help everyone. I got the mapValues approach working. I will > experiment with the reduceByKey approach later. > > <3 > > -Kris > > > > > -- >

Re: Computing mean and standard deviation by key

2014-08-01 Thread kriskalish
Thanks for the help everyone. I got the mapValues approach working. I will experiment with the reduceByKey approach later. <3 -Kris -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192p11214.html Sent from t

Re: Computing mean and standard deviation by key

2014-08-01 Thread Evan R. Sparks
Ignoring my warning about overflow - even more functional - just use a reduceByKey. Since your main operation is just a bunch of summing, you've got a commutative-associative reduce operation and spark will run do everything cluster-parallel, and then shuffle the (small) result set and merge appro

Re: Computing mean and standard deviation by key

2014-08-01 Thread Sean Owen
Here's the more functional programming-friendly take on the computation (but yeah this is the naive formula): rdd.groupByKey.mapValues { mcs => val values = mcs.map(_.foo.toDouble) val n = values.count val sum = values.sum val sumSquares = values.map(x => x * x).sum math.sqrt(n * sumSqua

Re: Computing mean and standard deviation by key

2014-08-01 Thread kriskalish
So if I do something like this, spark handles the parallelization and recombination of sum and count on the cluster automatically? I started peeking into the source and see that foreach does submit a job to the cluster, but it looked like the inner function needed to return something to work proper

Re: Computing mean and standard deviation by key

2014-08-01 Thread Evan R. Sparks
Computing the variance is similar to this example, you just need to keep around the sum of squares as well. The formula for variance is (sumsq/n) - (sum/n)^2 But with big datasets or large values, you can quickly run into overflow issues - MLlib handles this by maintaining the the average sum of

Re: Computing mean and standard deviation by key

2014-08-01 Thread Xu (Simon) Chen
I meant not sure how to do variance in one shot :-) With mean in hand, you can obvious broadcast the variable, and do another map/reduce to calculate variance per key. On Fri, Aug 1, 2014 at 4:39 PM, Xu (Simon) Chen wrote: > val res = rdd.map(t => (t._1, (t._2.foo, 1))).reduceByKey((x,y) => >

Re: Computing mean and standard deviation by key

2014-08-01 Thread Xu (Simon) Chen
val res = rdd.map(t => (t._1, (t._2.foo, 1))).reduceByKey((x,y) => (x._1+x._2, y._1+y._2)).collect This gives you a list of (key, (tot, count)), which you can easily calculate the mean. Not sure about variance. On Fri, Aug 1, 2014 at 2:55 PM, kriskalish wrote: > I have what seems like a relati

Re: Computing mean and standard deviation by key

2014-08-01 Thread Sean Owen
You're certainly not iterating on the driver. The Iterable you process in your function is on the cluster and done in parallel. On Fri, Aug 1, 2014 at 8:36 PM, Kristopher Kalish wrote: > The reason I want an RDD is because I'm assuming that iterating the > individual elements of an RDD on the dri

Re: Computing mean and standard deviation by key

2014-08-01 Thread Kristopher Kalish
The reason I want an RDD is because I'm assuming that iterating the individual elements of an RDD on the driver of the cluster is much slower than coming up with the mean and standard deviation using a map-reduce-based algorithm. I don't know the intimate details of Spark's implementation, but it