Hi All, Spark RDD pushes sorting operations and delegates aggregation into the shuffle layer by specifying a key ordering as part of the shuffle dependency. Now Spark SQL doesn't push sort and delegate aggregation into the shuffle layer,as the SPARK-8317 <https://issues.apache.org/jira/browse/SPARK-8317> mentioned <http://www.baidu.com/link?url=LdzylTpK4lFJ2_FOCqqJqXmBUJFcD-DlWSLPso8ZVsl8j41zl776r964Rb0rwvpI9c-QZCu5F6BCvSP95PrWsWxkbGo49rAoOXzGcP6njni>. But I think it maybe more friendly to push sort and delegate aggregation into the unsafe shuffle layer. Thus the keyOrdering and partitionIdOrdering can be performed together, that can make high-performance. So The key to the problem is which cases need keyOrdering in map stage. I can list some cases as followed:
1. For the aggregation whose map aggregation maybe sort aggregation; Spark sql firstly perform keyOrdering and aggregate at a map aggregation operator in map stage, then perform partitionIdOrdering at the unsafe shuffle layer. Thus in fact the sort is split two phase. If push keyOrdering into shuffle stage, then the keyOrdering and partitionIdOrdering can be performed together. 2. For the Sort Join; Spark sql don't sort at map stage, only perform sort at reduce stage. If push sort into shuffle stage in advance, thus the reduce stage only need Merge Sort. Thus the map stage share the sort load. Of course, as the SPARK-8317 <https://issues.apache.org/jira/browse/SPARK-8317> mentioned <http://www.baidu.com/link?url=LdzylTpK4lFJ2_FOCqqJqXmBUJFcD-DlWSLPso8ZVsl8j41zl776r964Rb0rwvpI9c-QZCu5F6BCvSP95PrWsWxkbGo49rAoOXzGcP6njni> , it is difficult for Spark Sql to choose specialized sorting implementations based on the types of rows being sorted at the shuffle stage. Is there any other reason beyond that for discarding the sort at shuffle stage? -- Regards John Fang