Thanks. Yes, unfortunately, they all need to be grouped. I guess I can partition the record by user id. However, I have millions of users, do you think partition by user id will help?
Jianguo On Mon, Jun 22, 2015 at 6:28 PM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > You’re right of course, I’m sorry. I was typing before thinking about > what you actually asked! > > On a second thought, what is the ultimate outcome for what you want the > sequence of pages for? Do they need to actually all be grouped? Could you > instead partition by user id then use a mapPartitions perhaps? > > From: Jianguo Li > Date: Monday, June 22, 2015 at 6:21 PM > To: Silvio Fiorito > Cc: "user@spark.apache.org" > Subject: Re: workaround for groupByKey > > Thanks for your suggestion. I guess aggregateByKey is similar to > combineByKey. I read in the Learning Sparking > > *We can disable map-side aggregation in combineByKey() if we know that > our data won’t benefit from it. For example, groupByKey() disables map-side > aggregation as the aggregation function (appending to a list) does not save > any space. If we want to disable map-side combines, we need to specify the > partitioner; for now you can just use the partitioner on the source RDD by > passingrdd.partitioner* > > It seems that when the map-side aggregation function is to append > something to a list (as opposed to summing over all the numbers), then this > map-side aggregation does not offer any benefit since appending to a list > does not save any space. Is my understanding correct? > > Thanks, > > Jianguo > > On Mon, Jun 22, 2015 at 4:43 PM, Silvio Fiorito < > silvio.fior...@granturing.com> wrote: > >> You can use aggregateByKey as one option: >> >> val input: RDD[Int, String] = ... >> >> val test = input.aggregateByKey(ListBuffer.empty[String])((a, b) => a >> += b, (a, b) => a ++ b) >> >> From: Jianguo Li >> Date: Monday, June 22, 2015 at 5:12 PM >> To: "user@spark.apache.org" >> Subject: workaround for groupByKey >> >> Hi, >> >> I am processing an RDD of key-value pairs. The key is an user_id, and >> the value is an website url the user has ever visited. >> >> Since I need to know all the urls each user has visited, I am tempted >> to call the groupByKey on this RDD. However, since there could be millions >> of users and urls, the shuffling caused by groupByKey proves to be a major >> bottleneck to get the job done. Is there any workaround? I want to end up >> with an RDD of key-value pairs, where the key is an user_id, the value is a >> list of all the urls visited by the user. >> >> Thanks, >> >> Jianguo >> > >