rluvaton commented on code in PR #15610: URL: https://github.com/apache/datafusion/pull/15610#discussion_r2041170126
########## datafusion/common/src/config.rs: ########## @@ -337,6 +337,13 @@ config_namespace! { /// batches and merged. pub sort_in_place_threshold_bytes: usize, default = 1024 * 1024 + /// When doing external sorting, the maximum number of spilled files to + /// read back at once. Those read files in the same merge step will be sort- + /// preserving-merged and re-spilled, and the step will be repeated to reduce + /// the number of spilled files in multiple passes, until a final sorted run + /// can be produced. + pub sort_max_spill_merge_degree: usize, default = 16 Review Comment: I have a concern about this that there can still be memory issue, if the batch from each stream together is above the memory limit I have an implementation for this that is completely memory safe and will try to create a PR for that for inspiration The way to decide on the degree is actually by storing for each spill file the largest amount of memory a single record batch taken, and then when deciding on the degree, you simply grow until you can no longer. The reason why I'm picky about this is that it is a new configuration that will be hard to deprecate or change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org