alamb commented on issue #15037:
URL: https://github.com/apache/datafusion/issues/15037#issuecomment-2762326990

   @adriangb and I had a discussion about 
https://github.com/apache/datafusion/pull/15301
   
   here are some notes:
   ## Usecases:
   - TopK dynamic filter pushdown
     -  Prune files with dynamic filter based on statistics
     - Prune row groups with dynamic filter based on statistics 
     - Prune row pages with dynamic filter based on statistics
     - Apply during filtering when pushdown enabled
   - Join SIPs
   
   ## Pros / Cons
   The pros for merging this PR are:
   - We already have benchmarks that show some performance improvement
   The cons:
   - It requires special implementation for any operators (like FileOpenenr) to 
take advantage of such filters. THis is not a blocker in my mind – but I do 
think implementing a PhysicalExpr is a cleaner design. As Adrian says, we can 
refactor it over time if/when PhysicalExpr gets more sophisticated
   -  We will get even more performance when filter_pushdown is enabled (again 
maybe this is just follow on work)
   
   ## Nice to haves
   -  For a plan with multiple partitions (e.g. for 16 input partitions, we end 
up with 17 top heaps – one for each partition and then a global one), but this 
PR can only apply the per-partition top k value.
     - It would be nice to somehow be able to use all the top values (aka pick 
the smallest one) when filtering. 
   - This PR takes a snapshot of the contents of the TopK heap when a file is 
opened and never changes it. 
     - This is good for pruning as all the pruning (file, row group and page) 
happens on file opening
     - It is not as good for filter_pushdown when the values in the topK heap 
can change over the course of the query so using the snapshot means the dynamic 
filter doesn’t improve over time
   
   I believe adrian is going to look into these – but I also think they could 
easily be done as a follow on PR
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to