Re: workaround for groupByKey

2015-06-23 Thread Silvio Fiorito
mapPartitions or one of the other combineByKey APIs? From: Jianguo Li Date: Tuesday, June 23, 2015 at 9:46 AM To: Silvio Fiorito Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Re: workaround for groupByKey Thanks. Yes, unfortunately, they all need to be groupe

Re: workaround for groupByKey

2015-06-23 Thread Jianguo Li
then use a mapPartitions perhaps? > > From: Jianguo Li > Date: Monday, June 22, 2015 at 6:21 PM > To: Silvio Fiorito > Cc: "user@spark.apache.org" > Subject: Re: workaround for groupByKey > > Thanks for your suggestion. I guess aggregateByKey is similar to > combi

Re: workaround for groupByKey

2015-06-22 Thread Silvio Fiorito
perhaps? From: Jianguo Li Date: Monday, June 22, 2015 at 6:21 PM To: Silvio Fiorito Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Re: workaround for groupByKey Thanks for your suggestion. I guess aggregateByKey is similar to combineByKey. I read in the Learn

Re: workaround for groupByKey

2015-06-22 Thread Jianguo Li
You can use aggregateByKey as one option: > > val input: RDD[Int, String] = ... > > val test = input.aggregateByKey(ListBuffer.empty[String])((a, b) => a += > b, (a, b) => a ++ b) > > From: Jianguo Li > Date: Monday, June 22, 2015 at 5:12 PM > To: "u

Re: workaround for groupByKey

2015-06-22 Thread ๏̯͡๏
ut: RDD[Int, String] = ... > > val test = input.aggregateByKey(ListBuffer.empty[String])((a, b) => a += > b, (a, b) => a ++ b) > > From: Jianguo Li > Date: Monday, June 22, 2015 at 5:12 PM > To: "user@spark.apache.org" > Subject: workaround for groupByKey > >

Re: workaround for groupByKey

2015-06-22 Thread Silvio Fiorito
.org>" Subject: workaround for groupByKey Hi, I am processing an RDD of key-value pairs. The key is an user_id, and the value is an website url the user has ever visited. Since I need to know all the urls each user has visited, I am tempted to call the groupByKey on this RDD. However

Re: workaround for groupByKey

2015-06-22 Thread ๏̯͡๏
There is reduceByKey that works on K,V. You need to accumulate partial results and proceed. does your computation allow that ? On Mon, Jun 22, 2015 at 2:12 PM, Jianguo Li wrote: > Hi, > > I am processing an RDD of key-value pairs. The key is an user_id, and the > value is an website url the us

workaround for groupByKey

2015-06-22 Thread Jianguo Li
Hi, I am processing an RDD of key-value pairs. The key is an user_id, and the value is an website url the user has ever visited. Since I need to know all the urls each user has visited, I am tempted to call the groupByKey on this RDD. However, since there could be millions of users and urls, the