Re: [PR] Fix: eliminate unnecessary repartitioning for small datasets [datafusion]

via GitHub Wed, 17 Dec 2025 02:52:57 -0800


ShashidharM0118 commented on code in PR #19073:
URL: https://github.com/apache/datafusion/pull/19073#discussion_r2626576791



##########
datafusion/core/src/physical_planner.rs:
##########
@@ -1599,6 +1603,25 @@ impl DefaultPhysicalPlanner {
     }
 }
 
+fn has_sufficient_rows_for_repartition(
+    input: &Arc<dyn ExecutionPlan>,
+    session_state: &SessionState,
+) -> Result<bool> {
+    // Get partition statistics, default to repartitioning if unavailable
+    let stats = match input.partition_statistics(None) {
+        Ok(s) => s,
+        Err(_) => return Ok(true),
+    };
+
+    if let Some(num_rows) = stats.num_rows.get_value().copied() {

Review Comment:
   hi @alamb, thanks for the suggestion to use 
`datafusion.optimizer.repartition_file_min_size` — it makes more sense to me, 
so I’ve switched the heuristic over to be size-based.
   
   I have two concerns:
   
   1. I’ve added a follow-up `num_rows` check that is used only when 
`total_byte_size` is missing. In other words, the code first applies the 
size-based threshold, and only if no size statistics are available does it fall 
back to `num_rows >= batch_size`. Does this fallback approach look reasonable?
   
   2. The `hash_agg_group_by_partitioned_on_dicts` test in 
`physical_planner.rs` (lines 3275–3304) was originally written to assert that a 
partitioned aggregate is produced on dictionary keys, but with the new 
size-based heuristic the small in-memory dict dataset no longer triggers 
repartitioning and the test now asserts `mode: Single`, which no longer matches 
the test name or original intent.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix: eliminate unnecessary repartitioning for small datasets [datafusion]

Reply via email to