Thanks. This should work fine. I am trying to avoid groupByKey for performance reasons as the input is a giant RDD. and the operation is a associative operation, so minimal shuffle if done via reduceByKey.
> On Jun 26, 2015, at 12:25 AM, Sven Krasser <kras...@gmail.com> wrote: > > Hey Kannappan, > > First of all, what is the reason for avoiding groupByKey since this is > exactly what it is for? If you must use reduceByKey with a one-liner, then > take a look at this: > > lambda a,b: (a if type(a) == list else [a]) + (b if type(b) == list else > [b]) > > In contrast to groupByKey, this won't return 'Yorkshire' as a one element > list but as a plain string (i.e. in the same way as in your output example). > > Hope this helps! > -Sven > > On Thu, Jun 25, 2015 at 3:37 PM, Kannappan Sirchabesan <buildka...@gmail.com > <mailto:buildka...@gmail.com>> wrote: > Hi, > I am trying to see what is the best way to reduce the values of a RDD of > (key,value) pairs into (key,ListOfValues) pair. I know various ways of > achieving this, but I am looking for a efficient, elegant one-liner if there > is one. > > Example: > Input RDD: (USA, California), (UK, Yorkshire), (USA, Colorado) > Output RDD: (USA, [California, Colorado]), (UK, Yorkshire) > > Is it possible to use reduceByKey or foldByKey to achieve this, instead of > groupBykey. > > Something equivalent to a cons operator from LISP?, so that I could just say > reduceBykey(lambda x,y: (cons x y) ). May be it is more a python question > than a spark question of how to create a list from 2 elements without a > starting empty list? > > Thanks, > Kannappan > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > <mailto:user-unsubscr...@spark.apache.org> > For additional commands, e-mail: user-h...@spark.apache.org > <mailto:user-h...@spark.apache.org> > > > > > -- > www.skrasser.com <http://www.skrasser.com/?utm_source=sig>