tobixdev commented on issue #12622: URL: https://github.com/apache/datafusion/issues/12622#issuecomment-2598092150
Merging the branch with the upstream is quite a project as there are many merge conflicts and a lot of incompatible code was created in the meantime (on the other hand, it's good to see so much progress on DataFusion). Before investing more time in this, I'd like to start another discussion about whether this is the solution we are going for (kind of like a result from adapting the new code from main). As I understand it, the new [`Scalar` struct](https://github.com/apache/datafusion/blob/logical-types/datafusion/expr-common/src/scalar.rs) is only necessary for carrying a physical data type for a logical value so that we can support operations like `arrow_cast`. Unfortunately, using `Scalar` in `ColumnarValue` and `Expr` breaks all patterns that try to match the scalar for these types. While we can fix this in the DataFusion code base (we need a nested match or smth similar in many cases), all downstream dependencies must rewrite their pattern-matching code, which can be (I assume) a huge undertaking for many projects. Here is an [example](https://github.com/apache/datafusion/commit/454db7e63c6d6af1c8975f29ce9b7ee9c37ea287#diff-d4a05272894f059368463815799d47bc9c33fd3f861ffcea827839f71938846fR111) from the current version of `logical-types`. Even this may be acceptable if the resulting solution is what we want, and it will remain so for the foreseeable future when considering the possible impact of logical types. However, being unable to match scalar types and values will undoubtedly impact ergonomics for some use cases. Furthermore, if we ever conclude that we do not need the physical type (e.g., because we can derive the physical type from the schema), these breaking changes could have been avoided. Therefore, are we sure that we require the `Scalar` type? If yes, I can proceed with updating the PR. If not, maybe we should try to infer the necessary physical types from the schema to avoid this large breaking change. I also [experimented]https://github.com/apache/datafusion/commit/527087fbbbd04465605a3de5cebeee54e50c846c#diff-49e275af8f09685c7bbc491db8ab3b9479960878f42ac558ec0e3e39570590bdR298) with adding a new variant `WithPhysicalType` to `ScalarValue`. This works great regarding breaking changes as most match statements only consider the set of "supported" scalars. Unfortunately, it allows users to miss "logically equivalent" values with a different physical type when they only match against the regular variants. So I also think this is not an adequate solution (e.g., `ScalarValue::WithPhysicalType(Box::new(ScalarValue::Utf8("abc"), DataType::Utf8View)` will not be matched by `ScalarValue::Utf8("abc")`. However, it may be a less invasive transitional vehicle to carry the physical data type until we can infer the data type from the schema (and update optimizations etc.) to respect this. > Since the logical-types branch can easily diverge from the main branch, even when the sub-tasks are incomplete, would it be better to merge it into the main branch frequently and continue evolving it as new ideas emerge? Adopting this strategy would mean that we cannot release logical types until we can get rid of the `WithPhysicalType` variant, which is unfortunate. However, not breaking the pattern-matching capabilities of `ColumnarValue` and `Expr` may be worth the effort, and synching the main branch should be more straightforward as we also do not have that many breaking changes. cc @notfilippo @findepi @alamb @jayzhan211 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org