alamb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2761489660
Thank you very much @adriangb -- given the new (warranted) complexity this feature is likely to add to DataFusion, and the fact if done right it can serve as the foundation for many advanced runtime filters, I would like to help make sure we get it as right as possible. Specifically, I would ilke to spend some time writing down the design for topk and what the pros / cons are (you have basically done it above, but I think it would help to consolidate into its own document -- maybe I just need to devote some more time to reading / studying this). I believe the approach of this PR could be summarized as 1. "create a snapshot of the current topK when each file is opened" This will works great for pruning files and when a query is reading from multiple files. I think it will not work as well when a query is reading from one/a few large files, where it would be advantageous to update the bounds in the filter over the course of the query as it becomes more and more selective I will try and find time to help with this over the weekend or next week -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org