Thanks for your suggestion. I guess aggregateByKey is similar to
combineByKey. I read in the Learning Sparking

*We can disable map-side aggregation in combineByKey() if we know that our
data won’t benefit from it. For example, groupByKey() disables map-side
aggregation as the aggregation function (appending to a list) does not save
any space. If we want to disable map-side combines, we need to specify the
partitioner; for now you can just use the partitioner on the source RDD by
passingrdd.partitioner*

It seems that when the map-side aggregation function is to append something
to a list (as opposed to summing over all the numbers), then this map-side
aggregation does not offer any benefit since appending to a list does not
save any space. Is my understanding correct?

Thanks,

Jianguo

On Mon, Jun 22, 2015 at 4:43 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

>  You can use aggregateByKey as one option:
>
>  val input: RDD[Int, String] = ...
>
>  val test = input.aggregateByKey(ListBuffer.empty[String])((a, b) => a +=
> b, (a, b) => a ++ b)
>
>   From: Jianguo Li
> Date: Monday, June 22, 2015 at 5:12 PM
> To: "user@spark.apache.org"
> Subject: workaround for groupByKey
>
>   Hi,
>
>  I am processing an RDD of key-value pairs. The key is an user_id, and
> the value is an website url the user has ever visited.
>
>  Since I need to know all the urls each user has visited, I am  tempted
> to call the groupByKey on this RDD. However, since there could be millions
> of users and urls, the shuffling caused by groupByKey proves to be a major
> bottleneck to get the job done. Is there any workaround? I want to end up
> with an RDD of key-value pairs, where the key is an user_id, the value is a
> list of all the urls visited by the user.
>
>  Thanks,
>
>  Jianguo
>

Reply via email to