In that case the reduceByKey operation will likely not give you any benefit (since you are not aggregating data into smaller values but instead building the same large list you'd build with groupByKey). If you look at rdd.py, you can see that both operations eventually use a similar operation to do the actual work:
agg = Aggregator(createCombiner, mergeValue, mergeCombiners) Best, -Sven On Thu, Jun 25, 2015 at 4:34 PM, Kannappan Sirchabesan <buildka...@gmail.com > wrote: > Thanks. This should work fine. > > I am trying to avoid groupByKey for performance reasons as the input is a > giant RDD. and the operation is a associative operation, so minimal shuffle > if done via reduceByKey. > > On Jun 26, 2015, at 12:25 AM, Sven Krasser <kras...@gmail.com> wrote: > > Hey Kannappan, > > First of all, what is the reason for avoiding groupByKey since this is > exactly what it is for? If you must use reduceByKey with a one-liner, then > take a look at this: > > lambda a,b: (a if type(a) == list else [a]) + (b if type(b) == list else > [b]) > > In contrast to groupByKey, this won't return 'Yorkshire' as a one element > list but as a plain string (i.e. in the same way as in your output example). > > Hope this helps! > -Sven > > On Thu, Jun 25, 2015 at 3:37 PM, Kannappan Sirchabesan < > buildka...@gmail.com> wrote: > >> Hi, >> I am trying to see what is the best way to reduce the values of a RDD >> of (key,value) pairs into (key,ListOfValues) pair. I know various ways of >> achieving this, but I am looking for a efficient, elegant one-liner if >> there is one. >> >> Example: >> Input RDD: (USA, California), (UK, Yorkshire), (USA, Colorado) >> Output RDD: (USA, [California, Colorado]), (UK, Yorkshire) >> >> Is it possible to use reduceByKey or foldByKey to achieve this, instead >> of groupBykey. >> >> Something equivalent to a cons operator from LISP?, so that I could just >> say reduceBykey(lambda x,y: (cons x y) ). May be it is more a python >> question than a spark question of how to create a list from 2 elements >> without a starting empty list? >> >> Thanks, >> Kannappan >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > > > -- > www.skrasser.com <http://www.skrasser.com/?utm_source=sig> > > > -- www.skrasser.com <http://www.skrasser.com/?utm_source=sig>