comphead commented on PR #103: URL: https://github.com/apache/datafusion-site/pull/103#issuecomment-3272391781
> 1. The TopK operator it comes from the TopK Heap. > 2. For joins we accumulate min/max values of the join keys for each partition as we build the build side. > So the values are coming from the data itself. Maybe it requires slightly more details for the reader. I'm still trying to grasp the idea. 🤔 First of all having the filter makes a lot of sense as we do not scan whats unnecessary. However to get the filter value (it doesn't have to be super accurate, just close to reduce the reading scope) it is possible to scan `select min(ts) from t1` first, and this refers to a single column which might be cheap, and even cheaper if min/max can be derived from the footer, and then apply the value for TopK filter. For the heap though the algorithm still not clear for me. How it makes sure we dont need to scan 100M rows as before, is it for any scenario, or when underlying files data are sorted? If the heap stored topK it still need to read all the rows? the benefit is we don't pay for full sorting and just for rebuild a heap. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org