Thanks for your suggestion. I guess aggregateByKey is similar to combineByKey. I read in the Learning Sparking
*We can disable map-side aggregation in combineByKey() if we know that our data won’t benefit from it. For example, groupByKey() disables map-side aggregation as the aggregation function (appending to a list) does not save any space. If we want to disable map-side combines, we need to specify the partitioner; for now you can just use the partitioner on the source RDD by passingrdd.partitioner* It seems that when the map-side aggregation function is to append something to a list (as opposed to summing over all the numbers), then this map-side aggregation does not offer any benefit since appending to a list does not save any space. Is my understanding correct? Thanks, Jianguo On Mon, Jun 22, 2015 at 4:43 PM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > You can use aggregateByKey as one option: > > val input: RDD[Int, String] = ... > > val test = input.aggregateByKey(ListBuffer.empty[String])((a, b) => a += > b, (a, b) => a ++ b) > > From: Jianguo Li > Date: Monday, June 22, 2015 at 5:12 PM > To: "user@spark.apache.org" > Subject: workaround for groupByKey > > Hi, > > I am processing an RDD of key-value pairs. The key is an user_id, and > the value is an website url the user has ever visited. > > Since I need to know all the urls each user has visited, I am tempted > to call the groupByKey on this RDD. However, since there could be millions > of users and urls, the shuffling caused by groupByKey proves to be a major > bottleneck to get the job done. Is there any workaround? I want to end up > with an RDD of key-value pairs, where the key is an user_id, the value is a > list of all the urls visited by the user. > > Thanks, > > Jianguo >