2010YOUY01 commented on PR #15610: URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2782275431
This PR and https://github.com/apache/datafusion/pull/15608 both implemented multi-level merge for `SortExec` for different purposes: ### This PR - This PR wants to let memory-limit sort queries be able to run even if the memory budget is very tight (i.e. num-spill-files * batch-size > memory limit) - Always re-spill for each merge step ### #15608 - Reduces merge degree for performance (reading spills will stall for a shorter amount of time) - Never re-spill I think we should refine the existing PR to be: 1. Prioritize stable execution of memory-limited queries over performance. - I think the optimizations mentioned below are somewhat complex. We should first resolve the remaining known correctness issues in external sort, strengthen the tests, and then proceed with later optimizations more confidently. 2. Extensible for future performance optimization - When the memory budget allows, don't always re-spill - Consider pre-fetching future spill reads to avoid blocking read - For other steps that require merging in `SortExec`, the multi-pass merging utility should be reusable for performance: For example if we have enough memory to buffer all input buffers, it should be able to do multi-level merging. (The first two points is related to https://github.com/apache/datafusion/issues/15323, the third point is tracked in https://github.com/apache/datafusion/issues/7181) To summarize, I think this PR needs to be restructured to make future optimizations easier to implement. I don’t have a solid idea yet, so I’ll keep thinking and also wait to hear more opinions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org