The following is a simplified example of what I am trying to accomplish.
Say I have an RDD of objects like this:
{
"country": "USA",
"name": "Franklin",
"age": 24,
"hits": 224}
{
"country": "USA",
"name": "Bob",
"age": 55,
"hits": 108}
{
"country": "France",
"name": "Remi",
"age": 33,
"hits": 72}
I want to find the average age and total number of hits per country.
Ideally, I would like to scan the data once and perform both aggregations
simultaneously.
What is a good approach to doing this?
I’m thinking that we’d want to keyBy(country), and then somehow
reduceByKey(). The problem is, I don’t know how to approach writing a
function that can be passed to reduceByKey() and that will track a running
average and total simultaneously.
Nick
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Patterns-for-making-multiple-aggregations-in-one-pass-tp7874.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.