GroupByKey results in OOM - Any other alternative

2014-06-14 Thread Vivek YS
Hi, For last couple of days I have been trying hard to get around this problem. Please share any insights on solving this problem. Problem : There is a huge list of (key, value) pairs. I want to transform this to (key, distinct values) and then eventually to (key, distinct values count) On sma

Re: GroupByKey results in OOM - Any other alternative

2014-06-14 Thread Vivek YS
hould be much more performant at the cost of some accuracy. > > > On Sat, Jun 14, 2014 at 1:58 PM, Vivek YS wrote: > >> Hi, >>For last couple of days I have been trying hard to get around this >> problem. Please share any insights on solving this problem. >> &g

Re: GroupByKey results in OOM - Any other alternative

2014-06-15 Thread Vivek YS
ters. It >>> can also be a problem if you do not have enough disk space, meaning that >>> you have to unpersist at the right points if you are running long flows. >>> >>> For us, even though the disk writes are a performance hit, we prefer the >>> Spark

Re: Broadcst RDD Lookup

2014-05-01 Thread Vivek YS
No I am sure the items match. Because userCluster & productCluster are prepared from "data" . Cross product of userCluster & productCluster is a super set of "data". On Thu, May 1, 2014 at 3:41 PM, Mayur Rustagi wrote: > Mostly none of the items in PairRDD match your input. Hence the error. >