westonpace opened a new issue, #12181: URL: https://github.com/apache/datafusion/issues/12181
### Is your feature request related to a problem or challenge? I am trying to take a filter expression created by pyarrow and convert it into a filter expression for Datafusion to satisfy. I am using Substrait to do this. Everything works fine when I use the standard Substrait types. However, when I use normal Arrow types that are not Substrait types (e.g. unsigned integers, large containers) I run into problems. It seems that arrow-cpp (admittedly, me, in this case) and datafusion have taken different approaches to handling these limitations. In arrow-cpp the types that expand or change the valid range of values (e.g. unsigned integers, large containers) are converted to extension types. This process is documented in https://github.com/apache/arrow/blob/main/format/substrait/extension_types.yaml In datafusion it appears these types are expected to use the nearest substrait match (e.g. signed integer, small container) with a type variation. ### Describe the solution you'd like I am admittedly biased (given I implemented one of the two disagreeing components) but I favor the extension types approach. Type variations are defined in Substrait as this: > Type variations may be used to represent differences in representation between different consumers. For example, an engine might support dictionary encoding for a string, or could be using either a row-wise or columnar representation of a struct. All variations of a type are expected to have the same semantics when operated on by functions or other expressions. Given that definition, I do not think it is valid to say that an unsigned integer is a variation of a signed integer (they do not have the same outputs for all functions). I do believe things like the view types and dictionary encoding are valid type variations. ### Describe alternatives you've considered The alternative would be to change arrow-cpp to also use type variations. Though I'd like some consensus from the Substrait community that this is a valid use of type variations before taking that approach. At the moment I am working around this issue by simply removing any non-standard types from the input schema (this works as long as the filter isn't referencing those types). ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
