There is reduceByKey that works on K,V. You need to accumulate partial
results and proceed. does your computation allow that ?



On Mon, Jun 22, 2015 at 2:12 PM, Jianguo Li <flyingfromch...@gmail.com>
wrote:

> Hi,
>
> I am processing an RDD of key-value pairs. The key is an user_id, and the
> value is an website url the user has ever visited.
>
> Since I need to know all the urls each user has visited, I am  tempted to
> call the groupByKey on this RDD. However, since there could be millions of
> users and urls, the shuffling caused by groupByKey proves to be a major
> bottleneck to get the job done. Is there any workaround? I want to end up
> with an RDD of key-value pairs, where the key is an user_id, the value is a
> list of all the urls visited by the user.
>
> Thanks,
>
> Jianguo
>



-- 
Deepak

Reply via email to