Re: [I] Support Push down expression evaluation in `TableProviders` [datafusion]

via GitHub Wed, 05 Mar 2025 13:56:50 -0800


adriangb commented on issue #14993:
URL: https://github.com/apache/datafusion/issues/14993#issuecomment-2702156471

> Being able to evaluate the EXTRACT (minute from "EventDate") expression
during the scan would be super helpful

Maybe this is what you meant but my mind it's possible to do even better:
instead of evaluating it during the scan, the file might even contain a
pre-evaluated version of it. You can imagine something like a column called
`_computed_<hash of expr's SQL>` so that you can read directly from (and use
statistics of) a column called `_computed_8has07fas98a` (made up hash) instead
of reading `EventDate` and computing `EXTRACT (minute from "EventDate")` on it.
This becomes even more useful if you can use the statistics of the pre-computed
expressions with a filter (e.g. `where extract(minute from ts) = 1`.

One note about this use case though: you can *mostly* achieve it today with
some rewrite rules (you rewrite `select EXTRACT (minute from "EventDate") from
hits` to `select _computed_8has07fas98a as "EXTRACT (minute from \"EventDate
")" from hits`. It's error prone and annoying though:
- You have to explicitly list all of these columns as part of your
`TableProvider::schema` which can be super annoying if you have 100s of them.
- Users can do `select _computed_8has07fas98a from hits` and they show up in
`show columns from hits`. My goal with #14362 was to solve this.

On top of these issues for a Variant type there is the issue that **these
columns can't vary file by file**.
While this is crucial for a Variant type I think it can still be very
helpful for other scenarios: if you are adding pre-computed columns as an
optimization it might make sense to only compute them during an optimization /
compaction pass. So you end up with some files that have the pre-computed
column and some that don't. Blindly rewriting the expression to `select
_computed_8has07fas98a` would by default result in incorrect results because
`SchemaAdapter` would fill in nulls instead of computing the expression in real
time. We worked around this by essentially forking `SchemaAdapter` and giving
it the ability to generate columns from other columns instead of always filling
them in with nulls, but that's very error prone and wonky.

> One possibility here might be add an API to TableProvider similar to
[TableProvider::supports_filters_pushdown](https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableProvider.html#method.supports_filters_pushdown)
> maybe it would be time to make TableProvider::scan_with_args

Something like that sounds great to me!

A couple things to think about taking the example of `select
variant_get(json_col, 'key')::int from data`:

Who is responsible for evaluating these expressions if they can vary on a
per-file basis?
If the `TableProvider` says "yes, I can evaluate that expression" it is then
responsible for doing the compute to evaluate it for every single file. If that
data comes from a shredded column that makes sense, it's cheap. But if it has
to start deserializing the Variant value column it's going to get expensive.
Maybe that's not an issue but I did want to point out that it blurs the lines
of where IO happens and where compute happens. If this is a problem I think it
would complicate the API substantially.

What expression does the `TableProvider` get passed? In this example if
could be `variant_get(json_col, 'key')` or `variant_get(json_col, 'key')::int`.
I'm guessing it's the former, and the same rules as
[TableProvider::supports_filters_pushdown](https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableProvider.html#method.supports_filters_pushdown)
apply.

That said I think the best path forward is likely to prototype something in
a PR and go from there 😄

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Support Push down expression evaluation in `TableProviders` [datafusion]

Reply via email to