Hi All,

    Spark RDD pushes sorting operations and delegates aggregation into the
shuffle layer by specifying a key ordering as part of the shuffle
dependency. Now Spark SQL doesn't push sort and delegate aggregation into
the shuffle layer,as the SPARK-8317
<https://issues.apache.org/jira/browse/SPARK-8317>  mentioned
<http://www.baidu.com/link?url=LdzylTpK4lFJ2_FOCqqJqXmBUJFcD-DlWSLPso8ZVsl8j41zl776r964Rb0rwvpI9c-QZCu5F6BCvSP95PrWsWxkbGo49rAoOXzGcP6njni>.
But I think it maybe more friendly to push sort and delegate aggregation
into the unsafe shuffle layer.  Thus the keyOrdering and
partitionIdOrdering can be performed together,  that can
make high-performance. So The key to the problem is which cases need
keyOrdering
in map stage. I can list some cases as followed:

   1. For the aggregation whose map aggregation maybe sort aggregation;
   Spark sql firstly perform  keyOrdering  and aggregate at a map
aggregation operator in map stage, then perform partitionIdOrdering at the
unsafe shuffle layer. Thus  in fact  the sort is split two phase.  If
push keyOrdering
into shuffle stage, then the keyOrdering and partitionIdOrdering can be
performed together.
   2. For the Sort Join;
   Spark sql don't sort at map stage, only perform sort at reduce stage. If
push sort into shuffle stage in advance, thus the reduce stage only need Merge
Sort. Thus the map stage share the sort load.

  Of course, as the SPARK-8317
<https://issues.apache.org/jira/browse/SPARK-8317>  mentioned
<http://www.baidu.com/link?url=LdzylTpK4lFJ2_FOCqqJqXmBUJFcD-DlWSLPso8ZVsl8j41zl776r964Rb0rwvpI9c-QZCu5F6BCvSP95PrWsWxkbGo49rAoOXzGcP6njni>
, it is difficult for Spark Sql to choose specialized sorting
implementations based on the types of rows being sorted at the shuffle
stage. Is there any other reason beyond that for discarding the sort at
shuffle stage?


-- 

Regards

John Fang

Reply via email to