On Fri, Feb 8, 2019 at 1:48 PM Venkatakrishnan Sowrirajan <vsowr...@asu.edu> wrote:
> 1. Btw you mean Skew estimation is done with in Spark for Sort? If so, can > you point me to the class which takes care of it? > Spark uses a RangePartitioner to distribute the data for a sort and skew estimation happens here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L162-L202 > 2. Wouldn't skew can cause failure while doing sorting? Though Sort > operator spills to disk, still wouldn't that be a problem? > No, Spark estimates the distribution's skew so that it can create approximately equal tasks. Even if tasks are skewed and large, Spark can spill to disk to avoid failing. > 3. Last question, couldn't understand the "better wall time" part, how > sorting would help in better wall time in the write or you meant in the > read part? > You get better wall time by balancing work across tasks instead of having skewed tasks. That distributes the data better so you use your parallelism. A skewed task cannot be parallelized and hurts wall time for a stage. -- Ryan Blue Software Engineer Netflix