Re: reduceByKey - add values to a list

Kannappan Sirchabesan Thu, 25 Jun 2015 17:02:21 -0700


> On Jun 26, 2015, at 12:46 AM, Sven Krasser <kras...@gmail.com> wrote:
> 
> In that case the reduceByKey operation will likely not give you any benefit 
> (since you are not aggregating data into smaller values but instead building 
> the same large list you'd build with groupByKey).


great. thanks!. i overlooked that. I guess it might even be better to use 
groupByKey if the aggregated list is very huge for some keys?.


> If you look at rdd.py, you can see that both operations eventually use a 
> similar operation to do the actual work:
> 
> agg = Aggregator(createCombiner, mergeValue, mergeCombiners)
> 
> Best,
> -Sven
> 
> On Thu, Jun 25, 2015 at 4:34 PM, Kannappan Sirchabesan <buildka...@gmail.com 
> <mailto:buildka...@gmail.com>> wrote:
> Thanks. This should work fine. 
> 
> I am trying to avoid groupByKey for performance reasons as the input is a 
> giant RDD. and the operation is a associative operation, so minimal shuffle 
> if done via reduceByKey.
> 
>> On Jun 26, 2015, at 12:25 AM, Sven Krasser <kras...@gmail.com 
>> <mailto:kras...@gmail.com>> wrote:
>> 
>> Hey Kannappan,
>> 
>> First of all, what is the reason for avoiding groupByKey since this is 
>> exactly what it is for? If you must use reduceByKey with a one-liner, then 
>> take a look at this:
>> 
>> lambda a,b: (a if type(a) == list else [a]) + (b if type(b) == list else 
>> [b])
>> 
>> In contrast to groupByKey, this won't return 'Yorkshire' as a one element 
>> list but as a plain string (i.e. in the same way as in your output example).
>> 
>> Hope this helps!
>> -Sven
>> 
>> On Thu, Jun 25, 2015 at 3:37 PM, Kannappan Sirchabesan <buildka...@gmail.com 
>> <mailto:buildka...@gmail.com>> wrote:
>> Hi,
>>   I am trying to see what is the best way to reduce the values of a RDD of 
>> (key,value) pairs into (key,ListOfValues) pair. I know various ways of 
>> achieving this, but I am looking for a efficient, elegant one-liner if there 
>> is one.
>> 
>> Example:
>> Input RDD: (USA, California), (UK, Yorkshire), (USA, Colorado)
>> Output RDD: (USA, [California, Colorado]), (UK, Yorkshire)
>> 
>> Is it possible to use reduceByKey or foldByKey to achieve this, instead of 
>> groupBykey.
>> 
>> Something equivalent to a cons operator from LISP?, so that I could just say 
>> reduceBykey(lambda x,y:  (cons x y) ). May be it is more a python question 
>> than a spark question of how to create a list from 2 elements without a 
>> starting empty list?
>> 
>> Thanks,
>> Kannappan
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> <mailto:user-h...@spark.apache.org>
>> 
>> 
>> 
>> 
>> -- 
>> www.skrasser.com <http://www.skrasser.com/?utm_source=sig>
> 
> 
> 
> 
> -- 
> www.skrasser.com <http://www.skrasser.com/?utm_source=sig>

Re: reduceByKey - add values to a list

Reply via email to