[jira] [Created] (SPARK-51398) SparkSQL supports sorting in the shuffle phase.

Chenyu Zheng (Jira) Wed, 05 Mar 2025 01:31:22 -0800

Chenyu Zheng created SPARK-51398:
------------------------------------

             Summary: SparkSQL supports sorting in the shuffle phase.
                 Key: SPARK-51398
                 URL: https://issues.apache.org/jira/browse/SPARK-51398
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 3.5.5
            Reporter: Chenyu Zheng



Currently, SparkSQL does not perform any sorting on the shuffle side. This 
issue hopes to provide an option to enable sorting in the shuffle phase.

Why do we need to sort on the shuffle side?

In some cloud scenarios, we want to avoid using disk for Spark job.

How to avoid using disk for Spark tasks?

Some job need to aggregate or sort. If the data is too large to fit in memory, 
spilling to disk is necessary. If we can move sort to shuffle, then use remote 
shuffle service, we can avoid disk operations.

What are the main changes in this issue?
(1) The change only applies to scenarios with shuffle. Therefore, sorting 
within a window is not supported.
(2) Sorting during the shuffle phase. This can avoid introducing SortExec after 
shuffle and avoid spilling.
(4) Provide memory-based HashAggregateExec to avoid spilling.

For join, use sort merge join, and the SortExec in reduce phase will be 
automatically deleted.
For aggregation. Use memory-based HashAggregateExec on the map side. Use 
SortAggregateExec on the reduce side. The SortExec will be automatically 
deleted between shuffle and SortAggregateExec.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-51398) SparkSQL supports sorting in the shuffle phase.

Reply via email to