adriangb commented on PR #17452: URL: https://github.com/apache/datafusion/pull/17452#issuecomment-3271843831
> > > @adriangb I think there is opportunity to simplify the bounds collection for each partition. That is, we can probably just track the min/max across all partitions and build a single `AND` binary expr once we have the final min/max (i.e. all partition bounds have been reported). > > > Aside from one less mutex, I think it'll help reduce output in `EXPLAIN` as well. Happy to tackle in a follow-up PR > > > > > > I think that will regress performance: imagine partition 1 has bounds (0, 1) and partition 2 has bounds (999998, 999999). With bounds per partition the value 1234 is filtered out. The merged bounds of (0, 999999) would include that value. > > Ah yes 🤦🏾 , definitely. Good catch! This is the fundamental limitation of a min/max bounds approach. For some queries / datasets it's going to be very effective, for others not at all. Hence why we are discussing pushing down bloom filters, etc. But keeping the min/max per partition is at least a good compromise for now / seems to be working well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org