alamb commented on PR #15301:
URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2761489660

   Thank you very much @adriangb  -- given the new (warranted) complexity this 
feature is likely to add to DataFusion, and the fact if done right it can serve 
as the foundation for many advanced runtime filters, I would like to help make 
sure we get it as right as possible. 
   
   
   Specifically, I would ilke to spend some time writing down the design for 
topk and what the pros / cons are (you have basically done it above, but I 
think it would help to consolidate into its own document -- maybe I just need 
to devote some more time to reading / studying this). 
   
   I believe the approach of this PR could be summarized as 
   1. "create a snapshot of the current topK when each file is opened"
   
   This will works great for pruning files and when a query is reading from 
multiple files. 
   
   I think it will not work as well when a query is reading from one/a few 
large files, where it would be advantageous to update the bounds in the filter 
over the course of the query as it becomes more and more selective
   
   I will try and find time to help with this over the weekend or next week
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to