Re: workaround for groupByKey

2015-06-23 Thread Silvio Fiorito
mapPartitions or one of the other combineByKey APIs? From: Jianguo Li Date: Tuesday, June 23, 2015 at 9:46 AM To: Silvio Fiorito Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Re: workaround for groupByKey Thanks. Yes, unfortunately, they all need to be groupe

Re: workaround for groupByKey

2015-06-23 Thread Jianguo Li
then use a mapPartitions perhaps? > > From: Jianguo Li > Date: Monday, June 22, 2015 at 6:21 PM > To: Silvio Fiorito > Cc: "user@spark.apache.org" > Subject: Re: workaround for groupByKey > > Thanks for your suggestion. I guess aggregateByKey is similar to > combi

Re: workaround for groupByKey

2015-06-22 Thread Silvio Fiorito
perhaps? From: Jianguo Li Date: Monday, June 22, 2015 at 6:21 PM To: Silvio Fiorito Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Re: workaround for groupByKey Thanks for your suggestion. I guess aggregateByKey is similar to combineByKey. I read in the Learn

Re: workaround for groupByKey

2015-06-22 Thread Jianguo Li
Thanks for your suggestion. I guess aggregateByKey is similar to combineByKey. I read in the Learning Sparking *We can disable map-side aggregation in combineByKey() if we know that our data won’t benefit from it. For example, groupByKey() disables map-side aggregation as the aggregation function

Re: workaround for groupByKey

2015-06-22 Thread ๏̯͡๏
Silvio, Suppose my RDD is (K-1, v1,v2,v3,v4). If i want to do simple addition i can use reduceByKey or aggregateByKey. What if my processing needs to check all the items in the value list each time, Above two operations do not get all the values, they just get two pairs (v1, v2) , you do some proc

Re: workaround for groupByKey

2015-06-22 Thread Silvio Fiorito
You can use aggregateByKey as one option: val input: RDD[Int, String] = ... val test = input.aggregateByKey(ListBuffer.empty[String])((a, b) => a += b, (a, b) => a ++ b) From: Jianguo Li Date: Monday, June 22, 2015 at 5:12 PM To: "user@spark.apache.org" Subject: wo

Re: workaround for groupByKey

2015-06-22 Thread ๏̯͡๏
There is reduceByKey that works on K,V. You need to accumulate partial results and proceed. does your computation allow that ? On Mon, Jun 22, 2015 at 2:12 PM, Jianguo Li wrote: > Hi, > > I am processing an RDD of key-value pairs. The key is an user_id, and the > value is an website url the us