I was going to suggest the same thing :). On Jun 18, 2014, at 4:56 PM, Evan R. Sparks <evan.spa...@gmail.com> wrote:
> This looks like a job for SparkSQL! > > > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext._ > case class MyRecord(country: String, name: String, age: Int, hits: Long) > val data = sc.parallelize(Array(MyRecord("USA", "Franklin", 24, 234), > MyRecord("USA", "Bob", 55, 108), MyRecord("France", "Remi", 33, 72))) > data.registerAsTable("MyRecords") > val results = sql("""SELECT t.country, AVG(t.age), SUM(t.hits) FROM MyRecords > t GROUP BY t.country""").collect > > Now "results" contains: > Array[org.apache.spark.sql.Row] = Array([France,33.0,72], [USA,39.5,342]) > > > > On Wed, Jun 18, 2014 at 4:42 PM, Doris Xin <doris.s....@gmail.com> wrote: > Hi Nick, > > Instead of using reduceByKey(), you might want to look into using > aggregateByKey(), which allows you to return a different value type U instead > of the input value type V for each input tuple (K, V). You can define U to be > a datatype that holds both the average and total and have seqOp update both > fields of U in a single pass. > > Hope this makes sense, > Doris > > > On Wed, Jun 18, 2014 at 4:28 PM, Nick Chammas <nicholas.cham...@gmail.com> > wrote: > The following is a simplified example of what I am trying to accomplish. > > Say I have an RDD of objects like this: > > { > "country": "USA", > "name": "Franklin", > "age": 24, > "hits": 224 > } > { > > "country": "USA", > "name": "Bob", > "age": 55, > "hits": 108 > } > { > > "country": "France", > "name": "Remi", > "age": 33, > "hits": 72 > } > I want to find the average age and total number of hits per country. Ideally, > I would like to scan the data once and perform both aggregations > simultaneously. > > What is a good approach to doing this? > > I’m thinking that we’d want to keyBy(country), and then somehow > reduceByKey(). The problem is, I don’t know how to approach writing a > function that can be passed to reduceByKey() and that will track a running > average and total simultaneously. > > Nick > > > > View this message in context: Patterns for making multiple aggregations in > one pass > Sent from the Apache Spark User List mailing list archive at Nabble.com. > >