> On Jun 26, 2015, at 12:46 AM, Sven Krasser <kras...@gmail.com> wrote: > > In that case the reduceByKey operation will likely not give you any benefit > (since you are not aggregating data into smaller values but instead building > the same large list you'd build with groupByKey).
great. thanks!. i overlooked that. I guess it might even be better to use groupByKey if the aggregated list is very huge for some keys?. > If you look at rdd.py, you can see that both operations eventually use a > similar operation to do the actual work: > > agg = Aggregator(createCombiner, mergeValue, mergeCombiners) > > Best, > -Sven > > On Thu, Jun 25, 2015 at 4:34 PM, Kannappan Sirchabesan <buildka...@gmail.com > <mailto:buildka...@gmail.com>> wrote: > Thanks. This should work fine. > > I am trying to avoid groupByKey for performance reasons as the input is a > giant RDD. and the operation is a associative operation, so minimal shuffle > if done via reduceByKey. > >> On Jun 26, 2015, at 12:25 AM, Sven Krasser <kras...@gmail.com >> <mailto:kras...@gmail.com>> wrote: >> >> Hey Kannappan, >> >> First of all, what is the reason for avoiding groupByKey since this is >> exactly what it is for? If you must use reduceByKey with a one-liner, then >> take a look at this: >> >> lambda a,b: (a if type(a) == list else [a]) + (b if type(b) == list else >> [b]) >> >> In contrast to groupByKey, this won't return 'Yorkshire' as a one element >> list but as a plain string (i.e. in the same way as in your output example). >> >> Hope this helps! >> -Sven >> >> On Thu, Jun 25, 2015 at 3:37 PM, Kannappan Sirchabesan <buildka...@gmail.com >> <mailto:buildka...@gmail.com>> wrote: >> Hi, >> I am trying to see what is the best way to reduce the values of a RDD of >> (key,value) pairs into (key,ListOfValues) pair. I know various ways of >> achieving this, but I am looking for a efficient, elegant one-liner if there >> is one. >> >> Example: >> Input RDD: (USA, California), (UK, Yorkshire), (USA, Colorado) >> Output RDD: (USA, [California, Colorado]), (UK, Yorkshire) >> >> Is it possible to use reduceByKey or foldByKey to achieve this, instead of >> groupBykey. >> >> Something equivalent to a cons operator from LISP?, so that I could just say >> reduceBykey(lambda x,y: (cons x y) ). May be it is more a python question >> than a spark question of how to create a list from 2 elements without a >> starting empty list? >> >> Thanks, >> Kannappan >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> <mailto:user-unsubscr...@spark.apache.org> >> For additional commands, e-mail: user-h...@spark.apache.org >> <mailto:user-h...@spark.apache.org> >> >> >> >> >> -- >> www.skrasser.com <http://www.skrasser.com/?utm_source=sig> > > > > > -- > www.skrasser.com <http://www.skrasser.com/?utm_source=sig>