2010YOUY01 commented on PR #15610:
URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2782275431

   This PR and https://github.com/apache/datafusion/pull/15608 both implemented 
multi-level merge for `SortExec` for different purposes:
   ### This PR
   - This PR wants to let memory-limit sort queries be able to run even if the 
memory budget is very tight (i.e. num-spill-files * batch-size > memory limit)
   - Always re-spill for each merge step
   ### #15608
   - Reduces merge degree for performance (reading spills will stall for a 
shorter amount of time)
   - Never re-spill
   
   I think we should refine the existing PR to be:
   1. Prioritize stable execution of memory-limited queries over performance.
       - I think the optimizations mentioned below are somewhat complex. We 
should first resolve the remaining known correctness issues in external sort, 
strengthen the tests, and then proceed with later optimizations more 
confidently.
   2. Extensible for future performance optimization
       - When the memory budget allows, don't always re-spill
       - Consider pre-fetching future spill reads to avoid blocking read
       - For other steps that require merging in `SortExec`, the multi-pass 
merging utility should be reusable for performance: For example if we have 
enough memory to buffer all input buffers, it should be able to do multi-level 
merging.
         (The first two points is related to 
https://github.com/apache/datafusion/issues/15323, the third point is tracked 
in https://github.com/apache/datafusion/issues/7181)
   
   
   To summarize, I think this PR needs to be restructured to make future 
optimizations easier to implement. I don’t have a solid idea yet, so I’ll keep 
thinking and also wait to hear more opinions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to