mapPartitions or one of the other combineByKey APIs?
From: Jianguo Li
Date: Tuesday, June 23, 2015 at 9:46 AM
To: Silvio Fiorito
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Re: workaround for groupByKey
Thanks. Yes, unfortunately, they all need to be groupe
then use a mapPartitions perhaps?
>
> From: Jianguo Li
> Date: Monday, June 22, 2015 at 6:21 PM
> To: Silvio Fiorito
> Cc: "user@spark.apache.org"
> Subject: Re: workaround for groupByKey
>
> Thanks for your suggestion. I guess aggregateByKey is similar to
> combi
perhaps?
From: Jianguo Li
Date: Monday, June 22, 2015 at 6:21 PM
To: Silvio Fiorito
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Re: workaround for groupByKey
Thanks for your suggestion. I guess aggregateByKey is similar to combineByKey.
I read in the Learn
You can use aggregateByKey as one option:
>
> val input: RDD[Int, String] = ...
>
> val test = input.aggregateByKey(ListBuffer.empty[String])((a, b) => a +=
> b, (a, b) => a ++ b)
>
> From: Jianguo Li
> Date: Monday, June 22, 2015 at 5:12 PM
> To: "u
ut: RDD[Int, String] = ...
>
> val test = input.aggregateByKey(ListBuffer.empty[String])((a, b) => a +=
> b, (a, b) => a ++ b)
>
> From: Jianguo Li
> Date: Monday, June 22, 2015 at 5:12 PM
> To: "user@spark.apache.org"
> Subject: workaround for groupByKey
>
>
.org>"
Subject: workaround for groupByKey
Hi,
I am processing an RDD of key-value pairs. The key is an user_id, and the value
is an website url the user has ever visited.
Since I need to know all the urls each user has visited, I am tempted to call
the groupByKey on this RDD. However
There is reduceByKey that works on K,V. You need to accumulate partial
results and proceed. does your computation allow that ?
On Mon, Jun 22, 2015 at 2:12 PM, Jianguo Li
wrote:
> Hi,
>
> I am processing an RDD of key-value pairs. The key is an user_id, and the
> value is an website url the us
Hi,
I am processing an RDD of key-value pairs. The key is an user_id, and the
value is an website url the user has ever visited.
Since I need to know all the urls each user has visited, I am tempted to
call the groupByKey on this RDD. However, since there could be millions of
users and urls, the