adriangb commented on PR #15301:
URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2736816854

   > I think this is just part of the picture. To fully match DuckDB we'd have 
to do something like the rewrite proposed in [#15177 
(comment)](https://github.com/apache/datafusion/issues/15177#issuecomment-2718074072)
 aka "late materialization" of the projection.
   
   To expand on this: what I implemented here is just "dump" filter pushdown. 
To make a query like `SELECT * FROM data ORDER BY id DESC LIMIT 10` fast you 
need the late materialization proposed in that comment or ordering and 
throttling of file reads (similar to 
[SortPreservingMerge](https://github.com/apache/datafusion/issues/15191):
   1. You need to order files within each partition so that you read ones "more 
likely" to produce meaningful filters first. So if you have files with id 
ranges `(1,5)` and `(3,8)` you should read the `(3,8)` file first. I guess 
TableProvider's and such need to handle this.
   2. You may want to consider reducing the number of partitions since the fan 
out may be wasted work: if you do (1) correctly and 1-2 files are enough to 
fill the TopK then a fan out to 32 partitions means you opened ~30 files for no 
reason and the whole query would have likely been faster if you focused all 
effort on those 1-2 files you actually needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to