adriangb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2736816854
> I think this is just part of the picture. To fully match DuckDB we'd have to do something like the rewrite proposed in [#15177 (comment)](https://github.com/apache/datafusion/issues/15177#issuecomment-2718074072) aka "late materialization" of the projection. To expand on this: what I implemented here is just "dump" filter pushdown. To make a query like `SELECT * FROM data ORDER BY id DESC LIMIT 10` fast you need the late materialization proposed in that comment or ordering and throttling of file reads (similar to [SortPreservingMerge](https://github.com/apache/datafusion/issues/15191): 1. You need to order files within each partition so that you read ones "more likely" to produce meaningful filters first. So if you have files with id ranges `(1,5)` and `(3,8)` you should read the `(3,8)` file first. I guess TableProvider's and such need to handle this. 2. You may want to consider reducing the number of partitions since the fan out may be wasted work: if you do (1) correctly and 1-2 files are enough to fill the TopK then a fan out to 32 partitions means you opened ~30 files for no reason and the whole query would have likely been faster if you focused all effort on those 1-2 files you actually needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org