On Fri, Feb 8, 2019 at 1:48 PM Venkatakrishnan Sowrirajan <vsowr...@asu.edu>
wrote:

> 1. Btw you mean Skew estimation is done with in Spark for Sort? If so, can
> you point me to the class which takes care of it?
>

Spark uses a RangePartitioner to distribute the data for a sort and skew
estimation happens here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L162-L202


> 2. Wouldn't skew can cause failure while doing sorting? Though Sort
> operator spills to disk, still wouldn't that be a problem?
>

No, Spark estimates the distribution's skew so that it can create
approximately equal tasks. Even if tasks are skewed and large, Spark can
spill to disk to avoid failing.


> 3. Last question, couldn't understand the "better wall time" part, how
> sorting would help in better wall time in the write or you meant in the
> read part?
>

You get better wall time by balancing work across tasks instead of having
skewed tasks. That distributes the data better so you use your parallelism.
A skewed task cannot be parallelized and hurts wall time for a stage.

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to