[ 
https://issues.apache.org/jira/browse/SPARK-51398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51398:
-----------------------------------
    Labels: pull-request-available  (was: )

> SparkSQL supports sorting in the shuffle phase.
> -----------------------------------------------
>
>                 Key: SPARK-51398
>                 URL: https://issues.apache.org/jira/browse/SPARK-51398
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.5.5
>            Reporter: Chenyu Zheng
>            Priority: Major
>              Labels: pull-request-available
>
> Currently, SparkSQL does not perform any sorting on the shuffle side. This 
> issue hopes to provide an option to enable sorting in the shuffle phase.
> Why do we need to sort on the shuffle side?
> In some cloud scenarios, we want to avoid using disk for Spark job.
> How to avoid using disk for Spark tasks?
> Some job need to aggregate or sort. If the data is too large to fit in 
> memory, spilling to disk is necessary. If we can move sort to shuffle, then 
> use remote shuffle service, we can avoid disk operations.
> What are the main changes in this issue?
> (1) The change only applies to scenarios with shuffle. Therefore, sorting 
> within a window is not supported.
> (2) Sorting during the shuffle phase. This can avoid introducing SortExec 
> after shuffle and avoid spilling.
> (4) Provide memory-based HashAggregateExec to avoid spilling.
> For join, use sort merge join, and the SortExec in reduce phase will be 
> automatically deleted.
> For aggregation. Use memory-based HashAggregateExec on the map side. Use 
> SortAggregateExec on the reduce side. The SortExec will be automatically 
> deleted between shuffle and SortAggregateExec.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to