vbarua commented on issue #13318: URL: https://github.com/apache/datafusion/issues/13318#issuecomment-2465908028
What I'm hoping to drive with the following wall of text is: 1. Agreement that this is worth improving. 2. Feedback and suggestions for possible approaches. # Current Handling State The following documents the current state of handling for the various extension points and discusses some of the issues with them. ## Extension Relations Substrait has 3 relation types: * [ExtensionLeafRel](https://github.com/substrait-io/substrait/blob/9cccb04fba336489b70ed42b71f73a0a1e34f9f5/proto/substrait/algebra.proto#L403-L407) * [ExtensionSingleRel](https://github.com/substrait-io/substrait/blob/9cccb04fba336489b70ed42b71f73a0a1e34f9f5/proto/substrait/algebra.proto#L396-L401) * [ExtensionMultiRel](https://github.com/substrait-io/substrait/blob/9cccb04fba336489b70ed42b71f73a0a1e34f9f5/proto/substrait/algebra.proto#L409C1-L414C2) which consume 0, 1 or more than 1 inputs respectively alongside a Protobuf Any detail message. The current API to decode these extensions re-uses the following method from the [SerializerRegistry](https://github.com/apache/datafusion/blob/main/datafusion/expr/src/registry.rs#L126-L142) trait: ```rust fn deserialize_logical_plan( &self, name: &str, bytes: &[u8], ) -> Result<Arc<dyn UserDefinedLogicalNode>>; ``` The use of this method assumes that the `detail` message maps directly to a `UserDefineLogicalNode`, which isn't always the case. Consider the following message intended for use with `ExtensionSingleRel` ```proto message UnnestDetail { repeated uint32 col_refs = 1; } ``` there isn't enough information in the message alone to produce a valid `UserDefinedLogicalNode`, however if we had access to the input node it becomes relatively straightforward. A simpler API for this would be a hook like: ```rust fn handle_extension_single_rel(e: &ExtensionSingleRel) -> Result<LogicalPlan> ``` to allow users to fully own the deserialization logic. ## Extension Table The Substrait consumer does not provide any hooks to handle extension tables. An API for this could be something like: ```rust fn handle_extension_table(e: &ExtensionTable) -> Result<LogicalPlan> ``` This may be more flexible that is actually needed. ## User-Defined Types The Substrait consumer does not provide any hooks to handle user-defined types. An API for this could be something like. ```rust fn handle_user_defined_type(ud: &UserDefined) -> Result<DataType> ``` ## User-Defined Functions The Substrait consumer decodes Substrait functions based on name of the function in Substrait. The following is part of the handling code: ```rust if let Some(func) = ctx.state().scalar_functions().get(fn_name) { Ok(Expr::ScalarFunction(expr::ScalarFunction::new_udf( func.to_owned(), args, ))) ``` To map a Substrait function to a DataFusion function, users can bind the DataFusion implementation to the same name in the session context. This works reasonable well, but the fact the function resolution ignores the extension file can cause issues. For example, it's perfectly valid to have 2 function with the same name exist in different extensions. The spec defines an integer division function: ```yaml # functions_arithmetic.yaml %YAML 1.2 --- scalar_functions: - name: "divide" impls: - args: - name: x value: i64 - name: y value: i64 return: i64 ``` but we may wish to provide a variant that returns a float as in MySQL: ```yaml # functions_mysql.yaml %YAML 1.2 --- scalar_functions: - name: "divide" - args: - name: x value: i64 - name: y value: i64 return: fp64 ``` However, these cannot be distinguished by just the name. Improving the API for function resolution should probably be its own issue as it entails a fair bit of work. ## Advanced Extensions DataFusion currently ignores Advanced Extensions entirely. Advanced Extensions come in two forms: * optimizations: additional metadata which *does not* influence semantics and are ignorable. * enhancements: additional metadata which *does* influence semantics and cannot be ignored. In my opinion, these are the hardest types of extensions to build around because they contain arbitrary metadata associated with existing relations. This may be applied/needed before the conversion, after the conversion or *during* the conversion. Trying to account for all possible uses of this metadata is quite tricky. In practice, what has worked well for the Java library is to make the code for converting the various components of a relation reusable so that users can fully customize the conversion of the relation as they need. # Consumer API Design The current consumer API starts by invoking the ```rust pub async fn from_substrait_plan( ctx: &SessionContext, plan: &Plan, ) -> Result<LogicalPlan> ... ``` function and then recursively invokes other utility functions like ```rust pub async fn from_substrait_rel( ctx: &SessionContext, rel: &Rel, extensions: &Extensions, ) -> Result<LogicalPlan> ... pub async fn from_substrait_rex( ctx: &SessionContext, e: &Expression, input_schema: &DFSchema, extensions: &Extensions, ) -> Result<Expr> pub async fn from_substrait_sorts( ctx: &SessionContext, substrait_sorts: &Vec<SortField>, input_schema: &DFSchema, extensions: &Extensions, ) -> Result<Vec<Sort>> ... ``` This API threads a SessionContext and Extension context through all the calls, even though they are only needed in a handful of functions. We could potentially add more parameters for the handlers, or even a struct of handlers to pass through, but I think it would be better to refactor the consumer code to be associated with a struct and associate the handlers with that instead. A design I have though about and experimented with is something like: ```rust trait SubstraitConsumer { fn get_context() -> SessionContext; fn get_extensions() -> Extensions; async fn from_substrait_rel(rel: &Rel) -> Result<LogicalPlan> { <default impl> } async fn from_substrait_rex( e: &Expression, input_schema: &DFSchema ) -> Result<LogicalPlan> { <default impl> } async fn from_substrait_sorts( ctx: &SessionContext, substrait_sorts: &Vec<SortField>, input_schema: &DFSchema, extensions: &Extensions, ) -> Result<Vec<Sort>> { <default impl> } ... // user extension async fn from_extension_single_rel(e: &ExtensionSingleRel) -> Result<LogicalPlan> { <default impl which produces an error> } async fn from_user_defined_type(ud: &UserDefined) -> Result<DataType> { <default impl which produces an error> } ... } ``` The default implementations would be sufficient to decode plans with no user-defined extensions, but anyone using custom extensions, types, etc would need to provide their handlings. I'm still exploring other alternatives, but thought it would be worthwhile to start a discussion around this and gather peoples thoughts and feedback. ## API Breakage Concerns As part of this work, I'm expecting a certain amount of API breakage as I don't expect to be able we'll be able to nail the design of this in one go. What we can do though is limit the amount of API churn for folks who only care about *standard* Substrait. For example ```rust pub async fn from_substrait_plan( ctx: &SessionContext, plan: &Plan, ) -> Result<LogicalPlan> ... ``` can still be used as the entry point for most users. We can minimize breakage for these users while we iterate and improve the consumer. Folks who want to use their own extensions will likely experience more breakage as we iterate, but at the moment they can't use them at all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
