rluvaton commented on code in PR #15610:
URL: https://github.com/apache/datafusion/pull/15610#discussion_r2041170126


##########
datafusion/common/src/config.rs:
##########
@@ -337,6 +337,13 @@ config_namespace! {
         /// batches and merged.
         pub sort_in_place_threshold_bytes: usize, default = 1024 * 1024
 
+        /// When doing external sorting, the maximum number of spilled files to
+        /// read back at once. Those read files in the same merge step will be 
sort-
+        /// preserving-merged and re-spilled, and the step will be repeated to 
reduce
+        /// the number of spilled files in multiple passes, until a final 
sorted run
+        /// can be produced.
+        pub sort_max_spill_merge_degree: usize, default = 16

Review Comment:
   I have a concern about this that there can still be memory issue, if the 
batch from each stream together is above the memory limit
   
   I have an implementation for this that is completely memory safe and will 
try to create a PR for that for inspiration
   
   The way to decide on the degree is actually by storing for each spill file 
the largest amount of memory a single record batch taken, and then when 
deciding on the degree, you simply grow until you can no longer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to