Re: [I] Redundant Repartition: `RoundRobinBatch` Followed by `Hash` in Physical Plans [datafusion]

via GitHub Sun, 06 Apr 2025 23:02:44 -0700


UBarney commented on issue #15601:
URL: https://github.com/apache/datafusion/issues/15601#issuecomment-2782098214


   > `tpch_sf1` And `tpch_sf10` by default already partition the input data, so 
AFAIK the plans should not be any different (they don't introduce round-robin 
repartition)
   
   That's not entirely accurate. At least for `tpch_sf1`, the plan includes a 
pattern like `hash_partition -> round_robin`, as seen in 
[q2_run_with_sf1](https://gist.github.com/UBarney/1fa47ed8bf043a88b646c1bc466994cb#file-gistfile1-txt-L255-L275).
   
   In contrast, `tpch_sf10` may avoid inserting a `RoundRobinBatch` followed by 
a `HashPartitioning`.
   
   
   
   I add this line
   ```patch
   --- a/datafusion/physical-optimizer/src/enforce_distribution.rs
   +++ b/datafusion/physical-optimizer/src/enforce_distribution.rs
   @@ -1262,6 +1262,7 @@ pub fn ensure_distribution(
                            // Add round-robin repartitioning on top of the 
operator
                            // to increase parallelism.
                            child = add_roundrobin_on_top(child, 
target_partitions)?;
   +                        println!("add rr");
                        }
                        // When inserting hash is necessary to satisfy hash 
requirement, insert hash repartition.
                        if hash_necessary {
   ```
   
   
[tpch_sf1_output](https://gist.githubusercontent.com/UBarney/1892acbec803e9b09230b6679524ddb3/raw/f62713979ee6ef1d37db1217d01a6cc108d804b4/gistfile1.txt)
 have some `add rr` in oputput
   
[tpch_sf10_output](https://gist.githubusercontent.com/UBarney/f97a0f30809fb10d484939f54f474926/raw/e94a3bba78012bf66c2e519a0b4ee222c0612926/gistfile1.txt)
 have no `add rr` in oputput
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Redundant Repartition: `RoundRobinBatch` Followed by `Hash` in Physical Plans [datafusion]

Reply via email to