I believe that these two were indeed originally related.  In the old
hash-based shuffle, we wrote objects out immediately to disk as they were
generated by an RDD's iterator. On the other hand, with the original
version of the new sort-based shuffle, Spark buffered a bunch of objects
before writing them out to disk.  My vague memory is that this caused
issues for Spark SQL -- I think because SQL got a performance improvement
from re-using the same objects when generating data from the iterator (but
if it re-used objects, the sort-based shuffle didn't work, because all of
the buffered objects would incorrectly point to the same underlying
object).  So, the default configuration was 200 so that SQL wouldn't use
the sort-based shuffle.  My memory is that the issues around this have
since been fixed but Michael / Reynold / Andrew Or probably have a better
memory of this.

-Kay

On Wed, Dec 28, 2016 at 7:05 PM, Liang-Chi Hsieh <vii...@gmail.com> wrote:

>
> This https://github.com/apache/spark/pull/1799 seems the first PR to
> introduce this number. But there is no explanation about the number.
>
>
> Jacek Laskowski wrote
> > Hi,
> >
> > I'm wondering what's so special about 200 to have it the default value
> > of spark.shuffle.sort.bypassMergeThreshold?
> >
> > Is this arbitrary number? Is there any theory behind it?
> >
> > Is the number of partitions in Spark SQL, i.e. 200, somehow related to
> > spark.shuffle.sort.bypassMergeThreshold?
> >
> > scala> spark.range(5).groupByKey(_ % 5).count.rdd.getNumPartitions
> > res3: Int = 200
> >
> > I'd appreciate any guidance to get the gist of this seemingly magic
> > number. Thanks!
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > ----
> > https://medium.com/@jaceklaskowski/
> > Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> > Follow me at https://twitter.com/jaceklaskowski
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
>
>
>
>
> -----
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Why-is-spark-shuffle-sort-
> bypassMergeThreshold-200-tp20379p20389.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Reply via email to