Re: Patterns for making multiple aggregations in one pass

Nicholas Chammas Wed, 18 Jun 2014 16:53:27 -0700

Ah, this looks like exactly what I need! It looks like this was recently added
into PySpark <https://github.com/apache/spark/pull/705/files#diff-6> (and
Spark Core), but it's not in the 1.0.0 release.


Thank you.

Nick


On Wed, Jun 18, 2014 at 7:42 PM, Doris Xin <[email protected]> wrote:

> Hi Nick,
>
> Instead of using reduceByKey(), you might want to look into using
> aggregateByKey(), which allows you to return a different value type U
> instead of the input value type V for each input tuple (K, V). You can
> define U to be a datatype that holds both the average and total and have
> seqOp update both fields of U in a single pass.
>
> Hope this makes sense,
> Doris
>
>
> On Wed, Jun 18, 2014 at 4:28 PM, Nick Chammas <[email protected]>
> wrote:
>
>> The following is a simplified example of what I am trying to accomplish.
>>
>> Say I have an RDD of objects like this:
>>
>> {
>>     "country": "USA",
>>     "name": "Franklin",
>>     "age": 24,
>>     "hits": 224}
>> {
>>
>>     "country": "USA",
>>     "name": "Bob",
>>     "age": 55,
>>     "hits": 108}
>> {
>>
>>     "country": "France",
>>     "name": "Remi",
>>     "age": 33,
>>     "hits": 72}
>>
>> I want to find the average age and total number of hits per country.
>> Ideally, I would like to scan the data once and perform both aggregations
>> simultaneously.
>>
>> What is a good approach to doing this?
>>
>> I’m thinking that we’d want to keyBy(country), and then somehow
>> reduceByKey(). The problem is, I don’t know how to approach writing a
>> function that can be passed to reduceByKey() and that will track a
>> running average and total simultaneously.
>>
>> Nick
>> 
>>
>> ------------------------------
>> View this message in context: Patterns for making multiple aggregations
>> in one pass
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Patterns-for-making-multiple-aggregations-in-one-pass-tp7874.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>
>

Re: Patterns for making multiple aggregations in one pass

Reply via email to