adriangb commented on issue #14993:
URL: https://github.com/apache/datafusion/issues/14993#issuecomment-2702156471

   > Being able to evaluate the EXTRACT (minute from "EventDate") expression 
during the scan would be super helpful
   
   Maybe this is what you meant but my mind it's possible to do even better: 
instead of evaluating it during the scan, the file might even contain a 
pre-evaluated version of it. You can imagine something like a column called 
`_computed_<hash of expr's SQL>` so that you can read directly from (and use 
statistics of) a column called `_computed_8has07fas98a` (made up hash) instead 
of reading `EventDate` and computing `EXTRACT (minute from "EventDate")` on it. 
This becomes even more useful if you can use the statistics of the pre-computed 
expressions with a filter (e.g. `where extract(minute from ts) = 1`.
   
   One note about this use case though: you can *mostly* achieve it today with 
some rewrite rules (you rewrite `select EXTRACT (minute from "EventDate") from 
hits` to `select _computed_8has07fas98a as "EXTRACT (minute from \"EventDate
   ")" from hits`. It's error prone and annoying though:
   - You have to explicitly list all of these columns as part of your 
`TableProvider::schema` which can be super annoying if you have 100s of them.
   - Users can do `select _computed_8has07fas98a from hits` and they show up in 
`show columns from hits`. My goal with #14362 was to solve this.
   
   On top of these issues for a Variant type there is the issue that **these 
columns can't vary file by file**.
   While this is crucial for a Variant type I think it can still be very 
helpful for other scenarios: if you are adding pre-computed columns as an 
optimization it might make sense to only compute them during an optimization / 
compaction pass. So you end up with some files that have the pre-computed 
column and some that don't. Blindly rewriting the expression to `select 
_computed_8has07fas98a` would by default result in incorrect results because 
`SchemaAdapter` would fill in nulls instead of computing the expression in real 
time. We worked around this by essentially forking `SchemaAdapter` and giving 
it the ability to generate columns from other columns instead of always filling 
them in with nulls, but that's very error prone and wonky.
   
   > One possibility here might be add an API to TableProvider similar to 
[TableProvider::supports_filters_pushdown](https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableProvider.html#method.supports_filters_pushdown)
   > maybe it would be time to make TableProvider::scan_with_args
   
   Something like that sounds great to me!
   
   A couple things to think about taking the example of `select 
variant_get(json_col, 'key')::int from data`:
   
   Who is responsible for evaluating these expressions if they can vary on a 
per-file basis?
   If the `TableProvider` says "yes, I can evaluate that expression" it is then 
responsible for doing the compute to evaluate it for every single file. If that 
data comes from a shredded column that makes sense, it's cheap. But if it has 
to start deserializing the Variant value column it's going to get expensive. 
Maybe that's not an issue but I did want to point out that it blurs the lines 
of where IO happens and where compute happens. If this is a problem I think it 
would complicate the API substantially.
   
   What expression does the `TableProvider` get passed? In this example if 
could be `variant_get(json_col, 'key')` or `variant_get(json_col, 'key')::int`. 
I'm guessing it's the former, and the same rules as 
[TableProvider::supports_filters_pushdown](https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableProvider.html#method.supports_filters_pushdown)
 apply.
   
   That said I think the best path forward is likely to prototype something in 
a PR and go from there 😄 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to