Re: Patterns for making multiple aggregations in one pass

Matei Zaharia Wed, 18 Jun 2014 17:20:06 -0700

I was going to suggest the same thing :).

On Jun 18, 2014, at 4:56 PM, Evan R. Sparks <evan.spa...@gmail.com> wrote:


> This looks like a job for SparkSQL!
> 
> 
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext._
> case class MyRecord(country: String, name: String, age: Int, hits: Long)
> val data = sc.parallelize(Array(MyRecord("USA", "Franklin", 24, 234), 
> MyRecord("USA", "Bob", 55, 108), MyRecord("France", "Remi", 33, 72)))
> data.registerAsTable("MyRecords")
> val results = sql("""SELECT t.country, AVG(t.age), SUM(t.hits) FROM MyRecords 
> t GROUP BY t.country""").collect
> 
> Now "results" contains:
> Array[org.apache.spark.sql.Row] = Array([France,33.0,72], [USA,39.5,342])
> 
> 
> 
> On Wed, Jun 18, 2014 at 4:42 PM, Doris Xin <doris.s....@gmail.com> wrote:
> Hi Nick,
> 
> Instead of using reduceByKey(), you might want to look into using 
> aggregateByKey(), which allows you to return a different value type U instead 
> of the input value type V for each input tuple (K, V). You can define U to be 
> a datatype that holds both the average and total and have seqOp update both 
> fields of U in a single pass.
> 
> Hope this makes sense,
> Doris
> 
> 
> On Wed, Jun 18, 2014 at 4:28 PM, Nick Chammas <nicholas.cham...@gmail.com> 
> wrote:
> The following is a simplified example of what I am trying to accomplish.
> 
> Say I have an RDD of objects like this:
> 
> {
>     "country": "USA",
>     "name": "Franklin",
>     "age": 24,
>     "hits": 224
> }
> {
> 
>     "country": "USA",
>     "name": "Bob",
>     "age": 55,
>     "hits": 108
> }
> {
> 
>     "country": "France",
>     "name": "Remi",
>     "age": 33,
>     "hits": 72
> }
> I want to find the average age and total number of hits per country. Ideally, 
> I would like to scan the data once and perform both aggregations 
> simultaneously.
> 
> What is a good approach to doing this?
> 
> I’m thinking that we’d want to keyBy(country), and then somehow 
> reduceByKey(). The problem is, I don’t know how to approach writing a 
> function that can be passed to reduceByKey() and that will track a running 
> average and total simultaneously.
> 
> Nick
> 
> 
> 
> View this message in context: Patterns for making multiple aggregations in 
> one pass
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
>

Re: Patterns for making multiple aggregations in one pass

Reply via email to