2010YOUY01 commented on code in PR #15610:
URL: https://github.com/apache/datafusion/pull/15610#discussion_r2041368520


##########
datafusion/common/src/config.rs:
##########
@@ -337,6 +337,13 @@ config_namespace! {
         /// batches and merged.
         pub sort_in_place_threshold_bytes: usize, default = 1024 * 1024
 
+        /// When doing external sorting, the maximum number of spilled files to
+        /// read back at once. Those read files in the same merge step will be 
sort-
+        /// preserving-merged and re-spilled, and the step will be repeated to 
reduce
+        /// the number of spilled files in multiple passes, until a final 
sorted run
+        /// can be produced.
+        pub sort_max_spill_merge_degree: usize, default = 16

Review Comment:
   > The reason why I'm picky about this is that it is a new configuration that 
will be hard to deprecate or change
   
   This is a solid point, this option is intended to be manually set, and it 
has to ensure `(max_batch_size * per_partition_merge_degree * partition_count) 
< total_memory_limit`. If it's set correctly for a query, then the query should 
succeed.
   The problem is the ever-growing number of configurations in DataFusion, and 
it seems impossible to set them all correctly. Enabling parallel merging 
optimization would require introducing yet another configuration, I'm also 
trying to avoid that (though too-many-configs problem might be a harsh reality 
we must accept).
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to