Thanks. Yes, unfortunately, they all need to be grouped. I guess I can
partition the record by user id. However, I have millions of users, do you
think partition by user id will help?

Jianguo

On Mon, Jun 22, 2015 at 6:28 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

>   You’re right of course, I’m sorry. I was typing before thinking about
> what you actually asked!
>
>  On a second thought, what is the ultimate outcome for what you want the
> sequence of pages for? Do they need to actually all be grouped? Could you
> instead partition by user id then use a mapPartitions perhaps?
>
>   From: Jianguo Li
> Date: Monday, June 22, 2015 at 6:21 PM
> To: Silvio Fiorito
> Cc: "user@spark.apache.org"
> Subject: Re: workaround for groupByKey
>
>   Thanks for your suggestion. I guess aggregateByKey is similar to
> combineByKey. I read in the Learning Sparking
>
>  *We can disable map-side aggregation in combineByKey() if we know that
> our data won’t benefit from it. For example, groupByKey() disables map-side
> aggregation as the aggregation function (appending to a list) does not save
> any space. If we want to disable map-side combines, we need to specify the
> partitioner; for now you can just use the partitioner on the source RDD by
> passingrdd.partitioner*
>
>  It seems that when the map-side aggregation function is to append
> something to a list (as opposed to summing over all the numbers), then this
> map-side aggregation does not offer any benefit since appending to a list
> does not save any space. Is my understanding correct?
>
>  Thanks,
>
>  Jianguo
>
> On Mon, Jun 22, 2015 at 4:43 PM, Silvio Fiorito <
> silvio.fior...@granturing.com> wrote:
>
>>  You can use aggregateByKey as one option:
>>
>>  val input: RDD[Int, String] = ...
>>
>>  val test = input.aggregateByKey(ListBuffer.empty[String])((a, b) => a
>> += b, (a, b) => a ++ b)
>>
>>   From: Jianguo Li
>> Date: Monday, June 22, 2015 at 5:12 PM
>> To: "user@spark.apache.org"
>> Subject: workaround for groupByKey
>>
>>   Hi,
>>
>>  I am processing an RDD of key-value pairs. The key is an user_id, and
>> the value is an website url the user has ever visited.
>>
>>  Since I need to know all the urls each user has visited, I am  tempted
>> to call the groupByKey on this RDD. However, since there could be millions
>> of users and urls, the shuffling caused by groupByKey proves to be a major
>> bottleneck to get the job done. Is there any workaround? I want to end up
>> with an RDD of key-value pairs, where the key is an user_id, the value is a
>> list of all the urls visited by the user.
>>
>>  Thanks,
>>
>>  Jianguo
>>
>
>

Reply via email to