Re: [PR] Dynamic filters blog post (rev 2) [datafusion-site]

via GitHub Tue, 09 Sep 2025 17:15:26 -0700


comphead commented on PR #103:
URL: https://github.com/apache/datafusion-site/pull/103#issuecomment-3272391781


   > 1. The TopK operator it comes from the TopK Heap.
   > 2. For joins we accumulate min/max values of the join keys for each 
partition as we build the build side.
   >    So the values are coming from the data itself.
   
   Maybe it requires slightly more details for the reader. I'm still trying to 
grasp the idea. 🤔 
   First of all having the filter makes a lot of sense as we do not scan whats 
unnecessary.
   
   However to get the filter value (it doesn't have to be super accurate, just 
close to reduce the reading scope) it is possible  to scan `select min(ts) from 
t1` first, and this refers to a single column which might be cheap, and even 
cheaper if min/max can be derived from the footer, and then apply the value for 
TopK filter.
   
   For the heap though the algorithm still not clear for me. How it makes sure 
we dont need to scan 100M rows as before, is it for any scenario, or when 
underlying files data are sorted? If the heap stored topK it still need to read 
all the rows? the benefit is we don't pay for full sorting and just for rebuild 
a heap. 
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Dynamic filters blog post (rev 2) [datafusion-site]

Reply via email to