Should we actually enable map-side-combine for groupByKey in Scala RDD as well, then? If we implement external-group-by should we implement it with the map-side-combine semantics that Pyspark does?
-Matt Cheah On 7/15/15, 8:21 AM, "Davies Liu" <dav...@databricks.com> wrote: >If the map-side-combine is not that necessary, given the fact that it >cannot >reduce the size of data for shuffling much (do need to serialized the key >for >each value), but can reduce the number of key-value pairs, and potential >reduce >the number of operations later (repartition and groupby). > >On Tue, Jul 14, 2015 at 7:11 PM, Matt Cheah <mch...@palantir.com> wrote: >> Hi everyone, >> >> I was examining the Pyspark implementation of groupByKey in rdd.py. I >>would >> like to submit a patch improving Scala RDD¹s groupByKey that has a >>similar >> robustness against large groups, as Pyspark¹s implementation has logic >>to >> spill part of a single group to disk along the way. >> >> Its implementation appears to do the following: >> >> Combine and group-by-key per partition locally, potentially spilling >> individual groups to disk >> Shuffle the data explicitly using partitionBy >> After the shuffle, do another local groupByKey to get the final result, >> again potentially spilling individual groups to disk >> >> My question is: what does the explicit map-side-combine step (#1) >> specifically benefit here? I was under the impression that >>map-side-combine >> for groupByKey was not optimal and is turned off in the Scala >>implementation >> Scala PairRDDFunctions.groupByKey calls to combineByKey with >> map-side-combine set to false. Is it something specific to how Pyspark >>can >> potentially spill the individual groups to disk? >> >> Thanks, >> >> -Matt Cheah >> >> P.S. Relevant Links: >> >> >>https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_ji >>ra_browse_SPARK-2D3074&d=BQIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oO >>nmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=J7Ee0rvGlgwL83LVX3ZI >>fTWTdAjACOGi3ozEffRaiBo&s=Onqi4oR_J4X2tV5u5NLiSnGdt31rRhHtD8R4KjBCQ9g&e= >> >>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_sp >>ark_pull_1977&d=BQIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hz >>wIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=J7Ee0rvGlgwL83LVX3ZIfTWTdAjAC >>OGi3ozEffRaiBo&s=weq4Epxezp-hx8AdFlbd4dWSqllNppF5HNhJC1KhTCI&e= >>
smime.p7s
Description: S/MIME cryptographic signature