2010YOUY01 opened a new issue, #16131:
URL: https://github.com/apache/datafusion/issues/16131

   ### Is your feature request related to a problem or challenge?
   
   Part of https://github.com/apache/datafusion/issues/16065
   Note: some steps are not related spilling execution, but they're very 
related general optimization, so I also mentioned them here.
   
   ## Background
   In sort queries, when the sort key is multiple columns, a row format can be 
used to accelerate comparison.
   
   See more about the background in:
   
https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/
   https://duckdb.org/pdf/ICDE2023-kuiper-muehleisen-sorting.pdf
   
   ## Current status
   Row format is used in the sort-preserving merge executor to speed up 
comparison among multiple merge inputs, but not used in the sort executor.
   
   Note: probably its better to help get 
https://github.com/apache/datafusion/pull/15380 merged first, then start the 
following steps.
   
   ## Purposed next steps
   1. Use row format by default in the sort executor. This may temporarily slow 
down end-to-end benchmarks, but performance can be recovered through future 
optimizations (see the next point). This change also resolves [issue 
#14748](https://github.com/apache/datafusion/issues/14748).
   
   2. A sort query executed across multiple partitions currently goes through 
two stages of SPM: `Local Sort -> SPM1 -> SPM2`. At each stage, the Arrow Row 
Format array is repeatedly converted without reuse. We should pass the 
converted rows downstream to enable reuse and reduce overhead.
   
   3. For further optimization in workloads with nearly sorted input, the 
converted rows can be used to compute each local sort batch’s min/max. This can 
help concatenate batches without gaps and reduce the degree of merging needed.
   
   4. For further optimization in memory-limited sorts, the converted rows can 
be written to the spill file and reused when reading back, improving efficiency.
   
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to