The following is a simplified example of what I am trying to accomplish.

Say I have an RDD of objects like this:

{
    "country": "USA",
    "name": "Franklin",
    "age": 24,
    "hits": 224}
{

    "country": "USA",
    "name": "Bob",
    "age": 55,
    "hits": 108}
{

    "country": "France",
    "name": "Remi",
    "age": 33,
    "hits": 72}

I want to find the average age and total number of hits per country.
Ideally, I would like to scan the data once and perform both aggregations
simultaneously.

What is a good approach to doing this?

I’m thinking that we’d want to keyBy(country), and then somehow
reduceByKey(). The problem is, I don’t know how to approach writing a
function that can be passed to reduceByKey() and that will track a running
average and total simultaneously.

Nick
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Patterns-for-making-multiple-aggregations-in-one-pass-tp7874.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to