Yes, in other words, a "bucket" is a single file in hash-based shuffle (no consolidation), but a segment of partitioned file in sort-based shuffle.
2015-08-19 5:52 GMT-07:00 Muhammad Haseeb Javed <11besemja...@seecs.edu.pk>: > Thanks Andrew for a detailed response, > > So the reason why key value pairs with same keys are always found in a > single buckets in Hash based shuffle but not in Sort is because in > sort-shuffle each mapper writes a single partitioned file, and it is up to > the reducer to fetch correct partitions from the the files ? > > On Wed, Aug 19, 2015 at 2:13 AM, Andrew Or <and...@databricks.com> wrote: > >> Hi Muhammad, >> >> On a high level, in hash-based shuffle each mapper M writes R shuffle >> files, one for each reducer where R is the number of reduce partitions. >> This results in M * R shuffle files. Since it is not uncommon for M and R >> to be O(1000), this quickly becomes expensive. An optimization with >> hash-based shuffle is consolidation, where all mappers run in the same core >> C write one file per reducer, resulting in C * R files. This is a strict >> improvement, but it is still relatively expensive. >> >> Instead, in sort-based shuffle each mapper writes a single partitioned >> file. This allows a particular reducer to request a specific portion of >> each mapper's single output file. In more detail, the mapper first fills up >> an internal buffer in memory and continually spills the contents of the >> buffer to disk, then finally merges all the spilled files together to form >> one final output file. This places much less stress on the file system and >> requires much fewer I/O operations especially on the read side. >> >> -Andrew >> >> >> >> 2015-08-16 11:08 GMT-07:00 Muhammad Haseeb Javed < >> 11besemja...@seecs.edu.pk>: >> >>> I did check it out and although I did get a general understanding of the >>> various classes used to implement Sort and Hash shuffles, however these >>> slides lack details as to how they are implemented and why sort generally >>> has better performance than hash >>> >>> On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran <ravikiranmag...@gmail.com> >>> wrote: >>> >>>> Have a look at this presentation. >>>> http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be >>>> of help to you. >>>> >>>> On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed < >>>> 11besemja...@seecs.edu.pk> wrote: >>>> >>>>> What are the major differences between how Sort based and Hash based >>>>> shuffle operate and what is it that cause Sort Shuffle to perform better >>>>> than Hash? >>>>> Any talks that discuss both shuffles in detail, how they are >>>>> implemented and the performance gains ? >>>>> >>>> >>>> >>> >> >