adriangb commented on issue #14993: URL: https://github.com/apache/datafusion/issues/14993#issuecomment-2702156471
> Being able to evaluate the EXTRACT (minute from "EventDate") expression during the scan would be super helpful Maybe this is what you meant but my mind it's possible to do even better: instead of evaluating it during the scan, the file might even contain a pre-evaluated version of it. You can imagine something like a column called `_computed_<hash of expr's SQL>` so that you can read directly from (and use statistics of) a column called `_computed_8has07fas98a` (made up hash) instead of reading `EventDate` and computing `EXTRACT (minute from "EventDate")` on it. This becomes even more useful if you can use the statistics of the pre-computed expressions with a filter (e.g. `where extract(minute from ts) = 1`. One note about this use case though: you can *mostly* achieve it today with some rewrite rules (you rewrite `select EXTRACT (minute from "EventDate") from hits` to `select _computed_8has07fas98a as "EXTRACT (minute from \"EventDate ")" from hits`. It's error prone and annoying though: - You have to explicitly list all of these columns as part of your `TableProvider::schema` which can be super annoying if you have 100s of them. - Users can do `select _computed_8has07fas98a from hits` and they show up in `show columns from hits`. My goal with #14362 was to solve this. On top of these issues for a Variant type there is the issue that **these columns can't vary file by file**. While this is crucial for a Variant type I think it can still be very helpful for other scenarios: if you are adding pre-computed columns as an optimization it might make sense to only compute them during an optimization / compaction pass. So you end up with some files that have the pre-computed column and some that don't. Blindly rewriting the expression to `select _computed_8has07fas98a` would by default result in incorrect results because `SchemaAdapter` would fill in nulls instead of computing the expression in real time. We worked around this by essentially forking `SchemaAdapter` and giving it the ability to generate columns from other columns instead of always filling them in with nulls, but that's very error prone and wonky. > One possibility here might be add an API to TableProvider similar to [TableProvider::supports_filters_pushdown](https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableProvider.html#method.supports_filters_pushdown) > maybe it would be time to make TableProvider::scan_with_args Something like that sounds great to me! A couple things to think about taking the example of `select variant_get(json_col, 'key')::int from data`: Who is responsible for evaluating these expressions if they can vary on a per-file basis? If the `TableProvider` says "yes, I can evaluate that expression" it is then responsible for doing the compute to evaluate it for every single file. If that data comes from a shredded column that makes sense, it's cheap. But if it has to start deserializing the Variant value column it's going to get expensive. Maybe that's not an issue but I did want to point out that it blurs the lines of where IO happens and where compute happens. If this is a problem I think it would complicate the API substantially. What expression does the `TableProvider` get passed? In this example if could be `variant_get(json_col, 'key')` or `variant_get(json_col, 'key')::int`. I'm guessing it's the former, and the same rules as [TableProvider::supports_filters_pushdown](https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableProvider.html#method.supports_filters_pushdown) apply. That said I think the best path forward is likely to prototype something in a PR and go from there 😄 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org