tobixdev commented on issue #12622:
URL: https://github.com/apache/datafusion/issues/12622#issuecomment-2637394367

   I've been trying to tackle the next issues in task (e.g., removing 
`Utf8View` and `Dictionary`). However, removing them introduces subtle bugs 
(e.g., trying to compare a dictionary array with a string array causing an 
error) that basically result from the pattern:
   
   1. Convert a physical object (e.g., ArrowArray) or a `DataType` to a 
`ScalarValue`
   2. Convert this `ScalarValue` into to a physical object (or call 
`scalar_value.data_type()`)
   
   An example would be `ScalarValue::try_from(DataType::Utf8View)?.to_array()`
   
   We can do one of two things in `TryFrom<DataType> for ScalarValue` once 
`Utf8View` has been removed:
   * Return an error for `DataType::Utf8View`. 
   * Return a `ScalarValue::Utf8`
   
   The problem with both approaches is that, once a test case fails, all 
debugging happens at runtime. I've been wondering if we could improve this by 
shifting the burden to compile time by changing the API of `Scalar` and 
`ScalarValue` before removing the problematic `ScalarValue` variants.
   
   Here are some thoughts:
   - There should be no function for creating a `ScalarValue` from a physical 
object or `DataType`. If this restriction is not imposed, it is easy to 
accidentally drop the physical type once we start removing variants. We would 
move these functionality to `Scalar`. One can still obtain a `ScalarValue` by 
calling `Scalar::try_from(DataType::Utf8View).into_value()`. However, the 
potential loss of the physical type is now explicit by calling `into_value()`.
   - There should be no method for creating a physical object from a 
`ScalarValue` without a `DataType`. If this restriction is not imposed, it is 
easy to create physical objects with an unexpected type. Methods like 
`to_array` would get a new `DataType` parameter that helps to avoid creating 
physical objects with an unexpected type. Methods on `Scalar` would not require 
this parameter as the data type is stored explicitly. Another approach would be 
to rename the current versions of `to_array` in `ScalarValue` to 
`to_canonical_array` (or smth. similar) and only have `to_array` in `Scalar`. 
This should prevent things like `Scalar{ value: ScalarValue::Utf8("test"), 
data_type: DataType::Utf8View}.into_value().to_array()` (of course in a more 
subtle scenario).
   
   I think creating these two restrictions would greatly reduce the problems 
with unexpected types at runtime in datafusion and downstream dependencies. We 
can also iterate with this process as we move more functionality from 
`ScalarValue` to `Scalar` and hopefully removing variants from `ScalarValue` 
will become easier after adapting the API. No functionality would get lost, we 
would just distribute it between `Scalar` and `ScalarValue` before removing the 
variants.
   
   LMK what you think.
   
   cc @jayzhan211


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to