Re: [I] [EPIC] Decouple logical from physical types [datafusion]

via GitHub Thu, 06 Feb 2025 00:00:47 -0800


tobixdev commented on issue #12622:
URL: https://github.com/apache/datafusion/issues/12622#issuecomment-2639093257

> I think getting `Scalar` from `DataType` or `ArrayRef` makes sense, so the
restriction of this doesn't seem sound to me.

What I meant is that we should not be able to directly create a
`ScalarValue` from these things as it is easy to accidentally drop the physical
type. For example, `ScalarValue::from(DataType::Utf8View)` should not be
possible but `Scalar::from(DataType::Utf8View)` should be possible. As you
said, one can call `into_value()` if you don't care about the physical type.

> Is it possible to carry Scalar along where we need to know the DataType?
If not, what might be the case? I guess we need get_value() instead of
into_value()?

I think we need to change many parts of DataFusion to use a `Scalar` instead
of a `ScalarValue` if we go with this approach. For example, in `Accumulator`
(`expr-common/src/accumulator.rs`):
- `Accumulator::evaluate` will need to return a `Scalar` instead of a
`ScalarValue`. Otherwise aggregates cannot evaluate to physical objects.
- This begs the question wheter `Accumulator::state` should return
`Result<Vec<Scalar>>` instead of `Result<Vec<ScalarValue>>` (I think so)

I think there are many parts like this in DF and maybe we can only start
removing `ScalarValue` variants once these parts have been migrated to use
`Scalar`. Otherwise, we might unintentionally loose type information.

> Getting Array from ScalarValue is valid since physical type is tightly
coupled by ScalarValue. When we switch to Scalar, ScalarValue no longer
contains physical type, so we can only transform Scalar which contains DataType
to Array.

> I think the approach you mentioned are only possible if Scalar exists?

Yes all of the above only applies to a scenario where we have `Scalar` for
the "physical Scalar" and `ScalarValue` for the logical value.

> Do you have any idea that is possible to work on main branch?

I don't think this particular approach works on main as we must have this
distinction between "physical" and "logical" Scalar. Then once we have this
distinction, we would need to migrate stuff like the `Accumulator` example to
`Scalar`. Once that has been done we can start removing the physical types from
`ScalarValue`.

However, I've been experimenting with a few other approaches that might work
in practice on main. These approaches aren't fleshed out but maybe you find
some of them interesting:

### `ColumnarValue` is not necessarily a Physical Object

Some experiments for applying this strategy when removing
`ScalarValue::Utf8View` and `ScalarValue::LargeUtf8` (tests not passing):
https://github.com/tobixdev/datafusion/compare/main...tobixdev:datafusion:alternative-logical-types?expand=1

A `ColumnarValue` is either a physical object (`ArrayRef`) or a logical
object (`ScalarValue`). We can remove variants from `ScalarValue` because it is
a logical value and must not have a `DataType`. However, somehow we must then
retain the type information to support things like casting to arrow types.

One approach to solve this issues is to use `PhysicalExpr` as a point where
we always have a `DataType`. Within a `PhysicalExpr` we would have the ability
to determine the physical type of a `ScalarValue` and `ColumnarValue`. Outside
of a `PhysicalExpr` we have no physical type of a `ScalarValue` and can obtain
the physical type of a `ColumnarValue` if it is an `ArrayRef`.

Some notable changes in the diff that you can look at:
- `datafusion/expr/src/udf.rs`: Add physical argument types to
`ScalarFunctionArgs`
- `datafusion/functions/src/string/bit_length.rs`: An example that can use
the `return_type` for physical type information
- `datafusion/functions/src/core/arrowtypeof.rs`: An example that uses the
physical types of the arguments

Some notes on this approach:
- Once we remove variants from `ScalarValue` we still should alter its API
to always take a `DataType` when creating physical objects as mentioned above.
The API for creating `ScalarValues` should probably be altered too to avoid
confusion (e.g., `ScalarValue::from(DataType::Utf8View)`. However, loosing the
type information is not as critical as in the `Scalar` approach as the
`PhysicalExpr` is responsible for retaining it.
- It shouldn't break that much existing code.
- We also would have to adjust the API from `ColumnarValue` as, for example,
`data_type()` can only return a `Option<DataType>`.
- `ColumnarValue` being something between "physical" and "logical" seems a
bit off.

### Logical Overlays over ScalarValue

In this approach `ScalarValue` (and `ColumnarValue`) would still be a
physical object. Based on this we introduce an API that can extract the logical
values from the physical values. Basically, this is a generalization of
[`try_as_str`](https://github.com/apache/datafusion/pull/14167) from @alamb.

One way of implementing this approach would require to have an enum that
mirrors `NativeType` (maybe `LogicalScalar`?) and adds a value to it. This enum
could also include a variant for extension types. Then we would have one scalar
type that mirrors the physical value and one that mirrors the logical values. I
think having this as an enum is good as it then supports pattern matching
(contrary to just calling `try_as_...`).

One possible API to obtain this could be:
```rust
let scalar_value = ... ;
match scalar_value.into_logical_scalar() {
LogicalScalar::Null => ...
LogicalScalar::Boolean(false) => ...
LogicalScalar::String(str) => ....
}
```

Furthermore, I would also like a generalized version of `try_as_str` like
`try_as_logical<LogicalString>()`. `LogicalString` may be similar to
`ArrowPrimitiveType` (e.g., `UInt64Type`). The idea is just to obtain a logical
value (associated type of `LogicalString`?) from a physical one which could
also be implemented in a trait to allow for extension types.

I cannot really say much about the benefits and drawbacks of this approach
as I haven't experimented in this direction. However, I find it interesting as
it retains compatibility with all approaches that assume `ScalarValue` to be a
physical value. Users basically would opt into better ergonomics (not handling
`Utf8View` and `Option<String>`) with logical values.

I'd like to experiment with the latter solution a bit. I'd also love to hear
your opinions and ideas.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [EPIC] Decouple logical from physical types [datafusion]

Reply via email to