Ah, this looks like exactly what I need! It looks like this was recently added into PySpark <https://github.com/apache/spark/pull/705/files#diff-6> (and Spark Core), but it's not in the 1.0.0 release.
Thank you. Nick On Wed, Jun 18, 2014 at 7:42 PM, Doris Xin <[email protected]> wrote: > Hi Nick, > > Instead of using reduceByKey(), you might want to look into using > aggregateByKey(), which allows you to return a different value type U > instead of the input value type V for each input tuple (K, V). You can > define U to be a datatype that holds both the average and total and have > seqOp update both fields of U in a single pass. > > Hope this makes sense, > Doris > > > On Wed, Jun 18, 2014 at 4:28 PM, Nick Chammas <[email protected]> > wrote: > >> The following is a simplified example of what I am trying to accomplish. >> >> Say I have an RDD of objects like this: >> >> { >> "country": "USA", >> "name": "Franklin", >> "age": 24, >> "hits": 224} >> { >> >> "country": "USA", >> "name": "Bob", >> "age": 55, >> "hits": 108} >> { >> >> "country": "France", >> "name": "Remi", >> "age": 33, >> "hits": 72} >> >> I want to find the average age and total number of hits per country. >> Ideally, I would like to scan the data once and perform both aggregations >> simultaneously. >> >> What is a good approach to doing this? >> >> I’m thinking that we’d want to keyBy(country), and then somehow >> reduceByKey(). The problem is, I don’t know how to approach writing a >> function that can be passed to reduceByKey() and that will track a >> running average and total simultaneously. >> >> Nick >> >> >> ------------------------------ >> View this message in context: Patterns for making multiple aggregations >> in one pass >> <http://apache-spark-user-list.1001560.n3.nabble.com/Patterns-for-making-multiple-aggregations-in-one-pass-tp7874.html> >> Sent from the Apache Spark User List mailing list archive >> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >> > >
