Yes, I think that bug is what I want. Thank you.
So I guess the current reason is that we don't want to buffer up numMapper
incoming streams. So we just iterate through each and transfer it over in
full because that is more network efficient?
I'm not sure I understand why you wouldn't want to so
The relevant JIRA that springs to mind is
https://issues.apache.org/jira/browse/SPARK-2926
If an aggregator and ordering are both defined, then the map side of
sort-based shuffle will sort based on the key ordering so that map-side
spills can be efficiently merged. We do not currently do a sort-b
One thing I have noticed with ExternalSorter is that if an ordering is not
defined, it does the sort using only the partition_id, instead of
(parition_id, hash). This means that on the reduce side you need to pull
the entire dataset into memory before you can begin iterating over the
results.
I f