Thanks Andrew for a detailed response, So the reason why key value pairs with same keys are always found in a single buckets in Hash based shuffle but not in Sort is because in sort-shuffle each mapper writes a single partitioned file, and it is up to the reducer to fetch correct partitions from the the files ?
On Wed, Aug 19, 2015 at 2:13 AM, Andrew Or <and...@databricks.com> wrote: > Hi Muhammad, > > On a high level, in hash-based shuffle each mapper M writes R shuffle > files, one for each reducer where R is the number of reduce partitions. > This results in M * R shuffle files. Since it is not uncommon for M and R > to be O(1000), this quickly becomes expensive. An optimization with > hash-based shuffle is consolidation, where all mappers run in the same core > C write one file per reducer, resulting in C * R files. This is a strict > improvement, but it is still relatively expensive. > > Instead, in sort-based shuffle each mapper writes a single partitioned > file. This allows a particular reducer to request a specific portion of > each mapper's single output file. In more detail, the mapper first fills up > an internal buffer in memory and continually spills the contents of the > buffer to disk, then finally merges all the spilled files together to form > one final output file. This places much less stress on the file system and > requires much fewer I/O operations especially on the read side. > > -Andrew > > > > 2015-08-16 11:08 GMT-07:00 Muhammad Haseeb Javed < > 11besemja...@seecs.edu.pk>: > >> I did check it out and although I did get a general understanding of the >> various classes used to implement Sort and Hash shuffles, however these >> slides lack details as to how they are implemented and why sort generally >> has better performance than hash >> >> On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran <ravikiranmag...@gmail.com> >> wrote: >> >>> Have a look at this presentation. >>> http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be >>> of help to you. >>> >>> On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed < >>> 11besemja...@seecs.edu.pk> wrote: >>> >>>> What are the major differences between how Sort based and Hash based >>>> shuffle operate and what is it that cause Sort Shuffle to perform better >>>> than Hash? >>>> Any talks that discuss both shuffles in detail, how they are >>>> implemented and the performance gains ? >>>> >>> >>> >> >