I think we should start without map-side-combine for Scala, because
it's easy to OOM
in JVM than in Python (we don't have hard limit in Python yet).
On Wed, Jul 15, 2015 at 9:52 AM, Matt Cheah wrote:
> Should we actually enable map-side-combine for groupByKey in Scala RDD as
> well, then? If we i
Should we actually enable map-side-combine for groupByKey in Scala RDD as
well, then? If we implement external-group-by should we implement it with
the map-side-combine semantics that Pyspark does?
-Matt Cheah
On 7/15/15, 8:21 AM, "Davies Liu" wrote:
>If the map-side-combine is not that necessa
If the map-side-combine is not that necessary, given the fact that it cannot
reduce the size of data for shuffling much (do need to serialized the key for
each value), but can reduce the number of key-value pairs, and potential reduce
the number of operations later (repartition and groupby).
On Tu
Hi everyone,
I was examining the Pyspark implementation of groupByKey in rdd.py. I would
like to submit a patch improving Scala RDD¹s groupByKey that has a similar
robustness against large groups, as Pyspark¹s implementation has logic to
spill part of a single group to disk along the way.
Its imp