Re: PySpark GroupByKey implementation question

2015-07-15 Thread Davies Liu
I think we should start without map-side-combine for Scala, because it's easy to OOM in JVM than in Python (we don't have hard limit in Python yet). On Wed, Jul 15, 2015 at 9:52 AM, Matt Cheah wrote: > Should we actually enable map-side-combine for groupByKey in Scala RDD as > well, then? If we i

Re: PySpark GroupByKey implementation question

2015-07-15 Thread Matt Cheah
Should we actually enable map-side-combine for groupByKey in Scala RDD as well, then? If we implement external-group-by should we implement it with the map-side-combine semantics that Pyspark does? -Matt Cheah On 7/15/15, 8:21 AM, "Davies Liu" wrote: >If the map-side-combine is not that necessa

Re: PySpark GroupByKey implementation question

2015-07-15 Thread Davies Liu
If the map-side-combine is not that necessary, given the fact that it cannot reduce the size of data for shuffling much (do need to serialized the key for each value), but can reduce the number of key-value pairs, and potential reduce the number of operations later (repartition and groupby). On Tu

PySpark GroupByKey implementation question

2015-07-14 Thread Matt Cheah
Hi everyone, I was examining the Pyspark implementation of groupByKey in rdd.py. I would like to submit a patch improving Scala RDD¹s groupByKey that has a similar robustness against large groups, as Pyspark¹s implementation has logic to spill part of a single group to disk along the way. Its imp