Thanks for the input. I will give foldByKey a shot. The way I am doing is, data is partitioned hourly. So I am computing distinct values hourly. Then I use unionRDD to merge them and compute distinct on the overall data.
> Is there a way to know which key,value pair is resulting in the OOM ? > Is there a way to set parallelism in the map stage so that, each worker will process one key at time. ? I didn't realise countApproxDistinctByKey is using hyperloglogplus. This should be interesting. --Vivek On Sat, Jun 14, 2014 at 11:37 PM, Sean Owen <so...@cloudera.com> wrote: > Grouping by key is always problematic since a key might have a huge number > of values. You can do a little better than grouping *all* values and *then* > finding distinct values by using foldByKey, putting values into a Set. At > least you end up with only distinct values in memory. (You don't need two > maps either, right?) > > If the number of distinct values is still huge for some keys, consider the > experimental method countApproxDistinctByKey: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L285 > > This should be much more performant at the cost of some accuracy. > > > On Sat, Jun 14, 2014 at 1:58 PM, Vivek YS <vivek...@gmail.com> wrote: > >> Hi, >> For last couple of days I have been trying hard to get around this >> problem. Please share any insights on solving this problem. >> >> Problem : >> There is a huge list of (key, value) pairs. I want to transform this to >> (key, distinct values) and then eventually to (key, distinct values count) >> >> On small dataset >> >> groupByKey().map( x => (x_1, x._2.distinct)) ...map(x => (x_1, >> x._2.distinct.count)) >> >> On large data set I am getting OOM. >> >> Is there a way to represent Seq of values from groupByKey as RDD and then >> perform distinct over it ? >> >> Thanks >> Vivek >> > >