Thanks Andrew for a detailed response,

So the reason why key value pairs with same keys are always found in a
single buckets in Hash based shuffle but not in Sort is because in
sort-shuffle each mapper writes a single partitioned file, and it is up to
the reducer to fetch correct partitions from the the files ?

On Wed, Aug 19, 2015 at 2:13 AM, Andrew Or <and...@databricks.com> wrote:

> Hi Muhammad,
>
> On a high level, in hash-based shuffle each mapper M writes R shuffle
> files, one for each reducer where R is the number of reduce partitions.
> This results in M * R shuffle files. Since it is not uncommon for M and R
> to be O(1000), this quickly becomes expensive. An optimization with
> hash-based shuffle is consolidation, where all mappers run in the same core
> C write one file per reducer, resulting in C * R files. This is a strict
> improvement, but it is still relatively expensive.
>
> Instead, in sort-based shuffle each mapper writes a single partitioned
> file. This allows a particular reducer to request a specific portion of
> each mapper's single output file. In more detail, the mapper first fills up
> an internal buffer in memory and continually spills the contents of the
> buffer to disk, then finally merges all the spilled files together to form
> one final output file. This places much less stress on the file system and
> requires much fewer I/O operations especially on the read side.
>
> -Andrew
>
>
>
> 2015-08-16 11:08 GMT-07:00 Muhammad Haseeb Javed <
> 11besemja...@seecs.edu.pk>:
>
>> I did check it out and although I did get a general understanding of the
>> various classes used to implement Sort and Hash shuffles, however these
>> slides lack details as to how they are implemented and why sort generally
>> has better performance than hash
>>
>> On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran <ravikiranmag...@gmail.com>
>> wrote:
>>
>>> Have a look at this presentation.
>>> http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be
>>> of help to you.
>>>
>>> On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed <
>>> 11besemja...@seecs.edu.pk> wrote:
>>>
>>>> What are the major differences between how Sort based and Hash based
>>>> shuffle operate and what is it that cause Sort Shuffle to perform better
>>>> than Hash?
>>>> Any talks that discuss both shuffles in detail, how they are
>>>> implemented and the performance gains ?
>>>>
>>>
>>>
>>
>

Reply via email to