alamb commented on issue #15037: URL: https://github.com/apache/datafusion/issues/15037#issuecomment-2762326990
@adriangb and I had a discussion about https://github.com/apache/datafusion/pull/15301 here are some notes: ## Usecases: - TopK dynamic filter pushdown - Prune files with dynamic filter based on statistics - Prune row groups with dynamic filter based on statistics - Prune row pages with dynamic filter based on statistics - Apply during filtering when pushdown enabled - Join SIPs ## Pros / Cons The pros for merging this PR are: - We already have benchmarks that show some performance improvement The cons: - It requires special implementation for any operators (like FileOpenenr) to take advantage of such filters. THis is not a blocker in my mind – but I do think implementing a PhysicalExpr is a cleaner design. As Adrian says, we can refactor it over time if/when PhysicalExpr gets more sophisticated - We will get even more performance when filter_pushdown is enabled (again maybe this is just follow on work) ## Nice to haves - For a plan with multiple partitions (e.g. for 16 input partitions, we end up with 17 top heaps – one for each partition and then a global one), but this PR can only apply the per-partition top k value. - It would be nice to somehow be able to use all the top values (aka pick the smallest one) when filtering. - This PR takes a snapshot of the contents of the TopK heap when a file is opened and never changes it. - This is good for pruning as all the pruning (file, row group and page) happens on file opening - It is not as good for filter_pushdown when the values in the topK heap can change over the course of the query so using the snapshot means the dynamic filter doesn’t improve over time I believe adrian is going to look into these – but I also think they could easily be done as a follow on PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org