I believe that these two were indeed originally related. In the old hash-based shuffle, we wrote objects out immediately to disk as they were generated by an RDD's iterator. On the other hand, with the original version of the new sort-based shuffle, Spark buffered a bunch of objects before writing them out to disk. My vague memory is that this caused issues for Spark SQL -- I think because SQL got a performance improvement from re-using the same objects when generating data from the iterator (but if it re-used objects, the sort-based shuffle didn't work, because all of the buffered objects would incorrectly point to the same underlying object). So, the default configuration was 200 so that SQL wouldn't use the sort-based shuffle. My memory is that the issues around this have since been fixed but Michael / Reynold / Andrew Or probably have a better memory of this.
-Kay On Wed, Dec 28, 2016 at 7:05 PM, Liang-Chi Hsieh <vii...@gmail.com> wrote: > > This https://github.com/apache/spark/pull/1799 seems the first PR to > introduce this number. But there is no explanation about the number. > > > Jacek Laskowski wrote > > Hi, > > > > I'm wondering what's so special about 200 to have it the default value > > of spark.shuffle.sort.bypassMergeThreshold? > > > > Is this arbitrary number? Is there any theory behind it? > > > > Is the number of partitions in Spark SQL, i.e. 200, somehow related to > > spark.shuffle.sort.bypassMergeThreshold? > > > > scala> spark.range(5).groupByKey(_ % 5).count.rdd.getNumPartitions > > res3: Int = 200 > > > > I'd appreciate any guidance to get the gist of this seemingly magic > > number. Thanks! > > > > Pozdrawiam, > > Jacek Laskowski > > ---- > > https://medium.com/@jaceklaskowski/ > > Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark > > Follow me at https://twitter.com/jaceklaskowski > > > > --------------------------------------------------------------------- > > To unsubscribe e-mail: > > > dev-unsubscribe@.apache > > > > > > ----- > Liang-Chi Hsieh | @viirya > Spark Technology Center > http://www.spark.tc/ > -- > View this message in context: http://apache-spark- > developers-list.1001551.n3.nabble.com/Why-is-spark-shuffle-sort- > bypassMergeThreshold-200-tp20379p20389.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >