tobixdev commented on issue #12622: URL: https://github.com/apache/datafusion/issues/12622#issuecomment-2639093257
> I think getting `Scalar` from `DataType` or `ArrayRef` makes sense, so the restriction of this doesn't seem sound to me. What I meant is that we should not be able to directly create a `ScalarValue` from these things as it is easy to accidentally drop the physical type. For example, `ScalarValue::from(DataType::Utf8View)` should not be possible but `Scalar::from(DataType::Utf8View)` should be possible. As you said, one can call `into_value()` if you don't care about the physical type. > Is it possible to carry Scalar along where we need to know the DataType? If not, what might be the case? I guess we need get_value() instead of into_value()? I think we need to change many parts of DataFusion to use a `Scalar` instead of a `ScalarValue` if we go with this approach. For example, in `Accumulator` (`expr-common/src/accumulator.rs`): - `Accumulator::evaluate` will need to return a `Scalar` instead of a `ScalarValue`. Otherwise aggregates cannot evaluate to physical objects. - This begs the question wheter `Accumulator::state` should return `Result<Vec<Scalar>>` instead of `Result<Vec<ScalarValue>>` (I think so) I think there are many parts like this in DF and maybe we can only start removing `ScalarValue` variants once these parts have been migrated to use `Scalar`. Otherwise, we might unintentionally loose type information. > Getting Array from ScalarValue is valid since physical type is tightly coupled by ScalarValue. When we switch to Scalar, ScalarValue no longer contains physical type, so we can only transform Scalar which contains DataType to Array. +1 > I think the approach you mentioned are only possible if Scalar exists? Yes all of the above only applies to a scenario where we have `Scalar` for the "physical Scalar" and `ScalarValue` for the logical value. > Do you have any idea that is possible to work on main branch? I don't think this particular approach works on main as we must have this distinction between "physical" and "logical" Scalar. Then once we have this distinction, we would need to migrate stuff like the `Accumulator` example to `Scalar`. Once that has been done we can start removing the physical types from `ScalarValue`. However, I've been experimenting with a few other approaches that might work in practice on main. These approaches aren't fleshed out but maybe you find some of them interesting: ### `ColumnarValue` is not necessarily a Physical Object Some experiments for applying this strategy when removing `ScalarValue::Utf8View` and `ScalarValue::LargeUtf8` (tests not passing): https://github.com/tobixdev/datafusion/compare/main...tobixdev:datafusion:alternative-logical-types?expand=1 A `ColumnarValue` is either a physical object (`ArrayRef`) or a logical object (`ScalarValue`). We can remove variants from `ScalarValue` because it is a logical value and must not have a `DataType`. However, somehow we must then retain the type information to support things like casting to arrow types. One approach to solve this issues is to use `PhysicalExpr` as a point where we always have a `DataType`. Within a `PhysicalExpr` we would have the ability to determine the physical type of a `ScalarValue` and `ColumnarValue`. Outside of a `PhysicalExpr` we have no physical type of a `ScalarValue` and can obtain the physical type of a `ColumnarValue` if it is an `ArrayRef`. Some notable changes in the diff that you can look at: - `datafusion/expr/src/udf.rs`: Add physical argument types to `ScalarFunctionArgs` - `datafusion/functions/src/string/bit_length.rs`: An example that can use the `return_type` for physical type information - `datafusion/functions/src/core/arrowtypeof.rs`: An example that uses the physical types of the arguments Some notes on this approach: - Once we remove variants from `ScalarValue` we still should alter its API to always take a `DataType` when creating physical objects as mentioned above. The API for creating `ScalarValues` should probably be altered too to avoid confusion (e.g., `ScalarValue::from(DataType::Utf8View)`. However, loosing the type information is not as critical as in the `Scalar` approach as the `PhysicalExpr` is responsible for retaining it. - It shouldn't break that much existing code. - We also would have to adjust the API from `ColumnarValue` as, for example, `data_type()` can only return a `Option<DataType>`. - `ColumnarValue` being something between "physical" and "logical" seems a bit off. ### Logical Overlays over ScalarValue In this approach `ScalarValue` (and `ColumnarValue`) would still be a physical object. Based on this we introduce an API that can extract the logical values from the physical values. Basically, this is a generalization of [`try_as_str`](https://github.com/apache/datafusion/pull/14167) from @alamb. One way of implementing this approach would require to have an enum that mirrors `NativeType` (maybe `LogicalScalar`?) and adds a value to it. This enum could also include a variant for extension types. Then we would have one scalar type that mirrors the physical value and one that mirrors the logical values. I think having this as an enum is good as it then supports pattern matching (contrary to just calling `try_as_...`). One possible API to obtain this could be: ```rust let scalar_value = ... ; match scalar_value.into_logical_scalar() { LogicalScalar::Null => ... LogicalScalar::Boolean(false) => ... LogicalScalar::String(str) => .... } ``` Furthermore, I would also like a generalized version of `try_as_str` like `try_as_logical<LogicalString>()`. `LogicalString` may be similar to `ArrowPrimitiveType` (e.g., `UInt64Type`). The idea is just to obtain a logical value (associated type of `LogicalString`?) from a physical one which could also be implemented in a trait to allow for extension types. I cannot really say much about the benefits and drawbacks of this approach as I haven't experimented in this direction. However, I find it interesting as it retains compatibility with all approaches that assume `ScalarValue` to be a physical value. Users basically would opt into better ergonomics (not handling `Utf8View` and `Option<String>`) with logical values. I'd like to experiment with the latter solution a bit. I'd also love to hear your opinions and ideas. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org