ShashidharM0118 commented on code in PR #19073:
URL: https://github.com/apache/datafusion/pull/19073#discussion_r2626576791
##########
datafusion/core/src/physical_planner.rs:
##########
@@ -1599,6 +1603,25 @@ impl DefaultPhysicalPlanner {
}
}
+fn has_sufficient_rows_for_repartition(
+ input: &Arc<dyn ExecutionPlan>,
+ session_state: &SessionState,
+) -> Result<bool> {
+ // Get partition statistics, default to repartitioning if unavailable
+ let stats = match input.partition_statistics(None) {
+ Ok(s) => s,
+ Err(_) => return Ok(true),
+ };
+
+ if let Some(num_rows) = stats.num_rows.get_value().copied() {
Review Comment:
hi @alamb, thanks for the suggestion to use
`datafusion.optimizer.repartition_file_min_size` — it makes more sense to me,
so I’ve switched the heuristic over to be size-based.
I have two concerns:
1. I’ve added a follow-up `num_rows` check that is used only when
`total_byte_size` is missing. In other words, the code first applies the
size-based threshold, and only if no size statistics are available does it fall
back to `num_rows >= batch_size`. Does this fallback approach look reasonable?
2. The `hash_agg_group_by_partitioned_on_dicts` test in
`physical_planner.rs` (lines 3275–3304) was originally written to assert that a
partitioned aggregate is produced on dictionary keys, but with the new
size-based heuristic the small in-memory dict dataset no longer triggers
repartitioning and the test now asserts `mode: Single`, which no longer matches
the test name or original intent.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]