Thanks. This should work fine. 

I am trying to avoid groupByKey for performance reasons as the input is a giant 
RDD. and the operation is a associative operation, so minimal shuffle if done 
via reduceByKey.

> On Jun 26, 2015, at 12:25 AM, Sven Krasser <kras...@gmail.com> wrote:
> 
> Hey Kannappan,
> 
> First of all, what is the reason for avoiding groupByKey since this is 
> exactly what it is for? If you must use reduceByKey with a one-liner, then 
> take a look at this:
> 
> lambda a,b: (a if type(a) == list else [a]) + (b if type(b) == list else 
> [b])
> 
> In contrast to groupByKey, this won't return 'Yorkshire' as a one element 
> list but as a plain string (i.e. in the same way as in your output example).
> 
> Hope this helps!
> -Sven
> 
> On Thu, Jun 25, 2015 at 3:37 PM, Kannappan Sirchabesan <buildka...@gmail.com 
> <mailto:buildka...@gmail.com>> wrote:
> Hi,
>   I am trying to see what is the best way to reduce the values of a RDD of 
> (key,value) pairs into (key,ListOfValues) pair. I know various ways of 
> achieving this, but I am looking for a efficient, elegant one-liner if there 
> is one.
> 
> Example:
> Input RDD: (USA, California), (UK, Yorkshire), (USA, Colorado)
> Output RDD: (USA, [California, Colorado]), (UK, Yorkshire)
> 
> Is it possible to use reduceByKey or foldByKey to achieve this, instead of 
> groupBykey.
> 
> Something equivalent to a cons operator from LISP?, so that I could just say 
> reduceBykey(lambda x,y:  (cons x y) ). May be it is more a python question 
> than a spark question of how to create a list from 2 elements without a 
> starting empty list?
> 
> Thanks,
> Kannappan
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 
> 
> 
> -- 
> www.skrasser.com <http://www.skrasser.com/?utm_source=sig>

Reply via email to