2010YOUY01 commented on PR #15610: URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2798782989
I didn't implement the parallel merge optimization for now, my major concern is: this optimization requires one extra configuration, and users have to learn and correctly set 2 configs for each individual query, to enable the most efficient cascaded spill merge execution (see the below `intended optimization` section for what's those 2 configs), which is not ideal. So I'd like to defer the implementation for a bit, to think about if there are any simpler approaches (or maybe by collecting stats internally and auto-tune those related configs) Also, I think the current implementation is good enough to cover common cases (I did a rough estimation, sorting TPCH-SF1000 lineitem table with 16GB of memory only requires one round of re-spill) Here is the optimization that I originally thought about, I'll put them into a separate issue if it makes sense. ### Example scenario For one partition's `SortExec`, 100 runs are spilled, and we set `spill_max_spill_merge_degree` to 4 ### Current Implementation  Each time it merges 4 existing spills into one combined spill file, until there are <= 4 spills total, the final result can be produced. For each entry, the number of re-spill will be $floor(\log_4 100)$ = 3 ### Intended optimization If the memory pool is enough to hold more buffers at a time (while `spill_max_spill_merge_degree` is still limited to 4, in case the merge-degree is too large and hurt performance in some cases) One additional config will be introduced `sort_buffer_batch_capacity`, and set to `16` in the above example, the execution will look like:  Then, inside each merge step, 16 spill files will be combined and re-spill. Each entry only need to be re-spilled for $floor(\log_{16} 100)$ = 1 time. With this approach, we can achieve an optimal re-spill count, and also enable parallel merge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org