There is reduceByKey that works on K,V. You need to accumulate partial results and proceed. does your computation allow that ?
On Mon, Jun 22, 2015 at 2:12 PM, Jianguo Li <flyingfromch...@gmail.com> wrote: > Hi, > > I am processing an RDD of key-value pairs. The key is an user_id, and the > value is an website url the user has ever visited. > > Since I need to know all the urls each user has visited, I am tempted to > call the groupByKey on this RDD. However, since there could be millions of > users and urls, the shuffling caused by groupByKey proves to be a major > bottleneck to get the job done. Is there any workaround? I want to end up > with an RDD of key-value pairs, where the key is an user_id, the value is a > list of all the urls visited by the user. > > Thanks, > > Jianguo > -- Deepak