Re: [PR] fix: synchronize partition bounds reporting in HashJoin [datafusion]

via GitHub Tue, 09 Sep 2025 11:33:27 -0700


adriangb commented on PR #17452:
URL: https://github.com/apache/datafusion/pull/17452#issuecomment-3271843831


   > > > @adriangb I think there is opportunity to simplify the bounds 
collection for each partition. That is, we can probably just track the min/max 
across all partitions and build a single `AND` binary expr once we have the 
final min/max (i.e. all partition bounds have been reported).
   > > > Aside from one less mutex, I think it'll help reduce output in 
`EXPLAIN` as well. Happy to tackle in a follow-up PR
   > > 
   > > 
   > > I think that will regress performance: imagine partition 1 has bounds 
(0, 1) and partition 2 has bounds (999998, 999999). With bounds per partition 
the value 1234 is filtered out. The merged bounds of (0, 999999) would include 
that value.
   > 
   > Ah yes 🤦🏾 , definitely. Good catch!
   
   This is the fundamental limitation of a min/max bounds approach. For some 
queries / datasets it's going to be very effective, for others not at all. 
Hence why we are discussing pushing down bloom filters, etc. But keeping the 
min/max per partition is at least a good compromise for now / seems to be 
working well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: synchronize partition bounds reporting in HashJoin [datafusion]

Reply via email to