alamb commented on issue #15513: URL: https://github.com/apache/datafusion/issues/15513#issuecomment-3027160719
Here is my suggestion for a blog / outline: The goal is a technical evangelism piece. The reader should come away having learned something about columnar query engines (not just that DataFusion is great, which it is!) # Title: Using Dynamic Filterers and to make TopK / LIMIT queries much faster # Structure: ## Intro w/ some sort of summary performance chart Running example: A simple example query -- I think the clickbench Q23 `SELECT * FROM hits ORDER BY time DESC LIMIT 10` is a pretty good one as it is so simple but illustrates the point. More details can be summarized from https://github.com/apache/datafusion/issues/15177 ## Background Show the plan for Q23 Explain the existing topk optimization (that there is a heap) Explain that the query does much more work than necessary because it decodes all rows just to throw all but 10 of them away Introduce the notion of filter pushdown and point out that DataFusion does it at multiple phases - Listing table (prune files) - During opening (prune files again) - During row group / data page filtering - During the scan (if `pushdown_filters` is on) ## Dynamic Filters Explain that the topk operator knows the minimum time that could be emitted after the plan started -- basically like `WHERE time > (current min in top k)`. However the current min isn't know at plan time Then describe the summary technical approach, highlighting that you made it general purpose to aslo support SIPs and other user defined dynamic filters; Also highlight you worked with the community to do this * Add an API for pushing down filter and introducing dynamic filters to ExecutionPlan trait * Add appropriate APIs for updating those filters at runtime and adding new points to prune (e..g on file open) ## Results Show some sort of results if possible ## Conclusion / Call to action This will be released in DataFusion 49 Come help / join us / use DataFusion 🎣 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org