Chenyu Zheng created SPARK-51398: ------------------------------------ Summary: SparkSQL supports sorting in the shuffle phase. Key: SPARK-51398 URL: https://issues.apache.org/jira/browse/SPARK-51398 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.5.5 Reporter: Chenyu Zheng
Currently, SparkSQL does not perform any sorting on the shuffle side. This issue hopes to provide an option to enable sorting in the shuffle phase. Why do we need to sort on the shuffle side? In some cloud scenarios, we want to avoid using disk for Spark job. How to avoid using disk for Spark tasks? Some job need to aggregate or sort. If the data is too large to fit in memory, spilling to disk is necessary. If we can move sort to shuffle, then use remote shuffle service, we can avoid disk operations. What are the main changes in this issue? (1) The change only applies to scenarios with shuffle. Therefore, sorting within a window is not supported. (2) Sorting during the shuffle phase. This can avoid introducing SortExec after shuffle and avoid spilling. (4) Provide memory-based HashAggregateExec to avoid spilling. For join, use sort merge join, and the SortExec in reduce phase will be automatically deleted. For aggregation. Use memory-based HashAggregateExec on the map side. Use SortAggregateExec on the reduce side. The SortExec will be automatically deleted between shuffle and SortAggregateExec. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org