tobixdev commented on issue #12622:
URL: https://github.com/apache/datafusion/issues/12622#issuecomment-2639093257

   > I think getting `Scalar` from `DataType` or `ArrayRef` makes sense, so the 
restriction of this doesn't seem sound to me.
   
   What I meant is that we should not be able to directly create a 
`ScalarValue` from these things as it is easy to accidentally drop the physical 
type. For example, `ScalarValue::from(DataType::Utf8View)` should not be 
possible but `Scalar::from(DataType::Utf8View)` should be possible. As you 
said, one can call `into_value()` if you don't care about the physical type.
   
   > Is it possible to carry Scalar along where we need to know the DataType? 
If not, what might be the case? I guess we need get_value() instead of 
into_value()?
   
   I think we need to change many parts of DataFusion to use a `Scalar` instead 
of a `ScalarValue` if we go with this approach. For example, in `Accumulator` 
(`expr-common/src/accumulator.rs`): 
   - `Accumulator::evaluate` will need to return a `Scalar` instead of a 
`ScalarValue`. Otherwise aggregates cannot evaluate to physical objects.
   - This begs the question wheter `Accumulator::state` should return 
`Result<Vec<Scalar>>` instead of `Result<Vec<ScalarValue>>` (I think so)
   
   I think there are many parts like this in DF and maybe we can only start 
removing `ScalarValue` variants once these parts have been migrated to use 
`Scalar`. Otherwise, we might unintentionally loose type information.
   
   > Getting Array from ScalarValue is valid since physical type is tightly 
coupled by ScalarValue. When we switch to Scalar, ScalarValue no longer 
contains physical type, so we can only transform Scalar which contains DataType 
to Array.
   
   +1 
   
   > I think the approach you mentioned are only possible if Scalar exists?
   
   Yes all of the above only applies to a scenario where we have `Scalar` for 
the "physical Scalar" and `ScalarValue` for the logical value.
   
   > Do you have any idea that is possible to work on main branch?
   
   I don't think this particular approach works on main as we must have this 
distinction between "physical" and "logical" Scalar. Then once we have this 
distinction, we would need to migrate stuff like the `Accumulator` example to 
`Scalar`. Once that has been done we can start removing the physical types from 
`ScalarValue`.
   
   However, I've been experimenting with a few other approaches that might work 
in practice on main. These approaches aren't fleshed out but maybe you find 
some of them interesting:
   
   ### `ColumnarValue` is not necessarily a Physical Object
   
   Some experiments for applying this strategy when removing 
`ScalarValue::Utf8View` and `ScalarValue::LargeUtf8` (tests not passing): 
https://github.com/tobixdev/datafusion/compare/main...tobixdev:datafusion:alternative-logical-types?expand=1
   
   A `ColumnarValue` is either a physical object (`ArrayRef`) or a logical 
object (`ScalarValue`). We can remove variants from `ScalarValue` because it is 
a logical value and must not have a `DataType`. However, somehow we must then 
retain the type information to support things like casting to arrow types.
   
   One approach to solve this issues is to use `PhysicalExpr` as a point where 
we always have a `DataType`. Within a `PhysicalExpr` we would have the ability 
to determine the physical type of a `ScalarValue` and `ColumnarValue`. Outside 
of a `PhysicalExpr` we have no physical type of a `ScalarValue` and can obtain 
the physical type of a `ColumnarValue` if it is an `ArrayRef`.
    
   Some notable changes in the diff that you can look at:
   - `datafusion/expr/src/udf.rs`:  Add physical argument types to 
`ScalarFunctionArgs`
   - `datafusion/functions/src/string/bit_length.rs`: An example that can use 
the `return_type` for physical type information
   - `datafusion/functions/src/core/arrowtypeof.rs`: An example that uses the 
physical types of the arguments 
   
   Some notes on this approach:
   - Once we remove variants from `ScalarValue` we still should alter its API 
to always take a `DataType` when creating physical objects as mentioned above. 
The API for creating `ScalarValues` should probably be altered too to avoid 
confusion (e.g., `ScalarValue::from(DataType::Utf8View)`. However, loosing the 
type information is not as critical as in the `Scalar` approach as the 
`PhysicalExpr` is responsible for retaining it.
   - It shouldn't break that much existing code.
   - We also would have to adjust the API from `ColumnarValue` as, for example, 
`data_type()` can only return a `Option<DataType>`.
   - `ColumnarValue` being something between "physical" and "logical" seems a 
bit off.
   
   ### Logical Overlays over ScalarValue
   
   In this approach `ScalarValue` (and `ColumnarValue`) would still be a 
physical object. Based on this we introduce an API that can extract the logical 
values from the physical values. Basically, this is a generalization of 
[`try_as_str`](https://github.com/apache/datafusion/pull/14167) from @alamb.
   
   One way of implementing this approach would require to have an enum that 
mirrors `NativeType` (maybe `LogicalScalar`?) and adds a value to it. This enum 
could also include a variant for extension types. Then we would have one scalar 
type that mirrors the physical value and one that mirrors the logical values. I 
think having this as an enum is good as it then supports pattern matching 
(contrary to just calling `try_as_...`).
   
   One possible API to obtain this  could be:
   ```rust
   let scalar_value = ... ;
   match scalar_value.into_logical_scalar() {
       LogicalScalar::Null => ...
       LogicalScalar::Boolean(false) => ...
       LogicalScalar::String(str) => ....
   }
   ```
   
   Furthermore, I would also like a generalized version of `try_as_str` like 
`try_as_logical<LogicalString>()`. `LogicalString` may be similar to 
`ArrowPrimitiveType` (e.g., `UInt64Type`). The idea is just to obtain a logical 
value (associated type of `LogicalString`?) from a physical one which could 
also be implemented in a trait to allow for extension types.
   
   I cannot really say much about the benefits and drawbacks of this approach 
as I haven't experimented in this direction. However, I find it interesting as 
it retains compatibility with all approaches that assume `ScalarValue` to be a 
physical value. Users basically would opt into better ergonomics (not handling 
`Utf8View` and `Option<String>`) with logical values.
   
   I'd like to experiment with the latter solution a bit. I'd also love to hear 
your opinions and ideas.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to