Re: [I] [substrait] improve handling for user-defined functionality [datafusion]

via GitHub Fri, 08 Nov 2024 15:45:45 -0800


vbarua commented on issue #13318:
URL: https://github.com/apache/datafusion/issues/13318#issuecomment-2465908028


   
   What I'm hoping to drive with the following wall of text is:
   1. Agreement that this is worth improving.
   2. Feedback and suggestions for possible approaches.
   
   # Current Handling State
   The following documents the current state of handling for the various 
extension points and discusses some of the issues with them.
   
   ## Extension Relations
   Substrait has 3 relation types:
   * 
[ExtensionLeafRel](https://github.com/substrait-io/substrait/blob/9cccb04fba336489b70ed42b71f73a0a1e34f9f5/proto/substrait/algebra.proto#L403-L407)
   * 
[ExtensionSingleRel](https://github.com/substrait-io/substrait/blob/9cccb04fba336489b70ed42b71f73a0a1e34f9f5/proto/substrait/algebra.proto#L396-L401)
   * 
[ExtensionMultiRel](https://github.com/substrait-io/substrait/blob/9cccb04fba336489b70ed42b71f73a0a1e34f9f5/proto/substrait/algebra.proto#L409C1-L414C2)
   
   which consume 0, 1 or more than 1 inputs respectively alongside a Protobuf 
Any detail message.
   
   The current API to decode these extensions re-uses the following method from 
the 
[SerializerRegistry](https://github.com/apache/datafusion/blob/main/datafusion/expr/src/registry.rs#L126-L142)
 trait:
   
   ```rust
   fn deserialize_logical_plan(
       &self,
       name: &str,
       bytes: &[u8],
   ) -> Result<Arc<dyn UserDefinedLogicalNode>>;
   ```
   The use of this method assumes that the `detail` message maps directly to a 
`UserDefineLogicalNode`, which isn't always the case. Consider the following 
message intended for use with `ExtensionSingleRel`
   
   ```proto
   message UnnestDetail {
     repeated uint32 col_refs = 1;
   }
   ```
   there isn't enough information in the message alone to produce a valid 
`UserDefinedLogicalNode`, however if we had access to the input node it becomes 
relatively straightforward.
   
   A simpler API for this would be a hook like:
   ```rust
   fn handle_extension_single_rel(e: &ExtensionSingleRel) -> Result<LogicalPlan>
   ```
   to allow users to fully own the deserialization logic.
   
   ## Extension Table
   The Substrait consumer does not provide any hooks to handle extension 
tables. An API for this could be something like:
   ```rust
   fn handle_extension_table(e: &ExtensionTable) -> Result<LogicalPlan>
   ```
   
   This may be more flexible that is actually needed.
   
   ## User-Defined Types
   The Substrait consumer does not provide any hooks to handle user-defined 
types. An API for this could be something like.
   ```rust
   fn handle_user_defined_type(ud: &UserDefined) -> Result<DataType>
   ```
   
   ## User-Defined Functions
   The Substrait consumer decodes Substrait functions based on name of the 
function in Substrait. The following is part of the handling code:
   ```rust
       if let Some(func) = ctx.state().scalar_functions().get(fn_name) {
           Ok(Expr::ScalarFunction(expr::ScalarFunction::new_udf(
               func.to_owned(),
               args,
           )))
   ```
   
   To map a Substrait function to a DataFusion function, users can bind the 
DataFusion implementation to the same name in the session context. This works 
reasonable well, but the fact the function resolution ignores the extension 
file can cause issues.
   
   For example, it's perfectly valid to have 2 function with the same name 
exist in different extensions. The spec defines an integer division function:
   ```yaml
   # functions_arithmetic.yaml
   %YAML 1.2
   ---
   scalar_functions:
     -
       name: "divide"
       impls:
         - args:
             - name: x
               value: i64
             - name: y
               value: i64
           return: i64
   ```
   but we may wish to provide a variant that returns a float as in MySQL:
   ```yaml
   # functions_mysql.yaml
   %YAML 1.2
   ---
   scalar_functions:
     -
       name: "divide"
         - args:
             - name: x
               value: i64
             - name: y
               value: i64
           return: fp64
   ```
   However, these cannot be distinguished by just the name.
   
   Improving the API for function resolution should probably be its own issue 
as it entails a fair bit of work.
   
   ## Advanced Extensions
   DataFusion currently ignores Advanced Extensions entirely. Advanced 
Extensions come in two forms:
   * optimizations: additional metadata which *does not* influence semantics 
and are ignorable.
   * enhancements: additional metadata which *does* influence semantics and 
cannot be ignored.
   
   In my opinion, these are the hardest types of extensions to build around 
because they contain arbitrary metadata associated with existing relations. 
This may be applied/needed before the conversion, after the conversion or 
*during* the conversion. Trying to account for all possible uses of this 
metadata is quite tricky.
   
   In practice, what has worked well for the Java library is to make the code 
for converting the various components of a relation reusable so that users can 
fully customize the conversion of the relation as they need.
   
   
   # Consumer API Design
   The current consumer API starts by invoking the 
   ```rust
   pub async fn from_substrait_plan(
       ctx: &SessionContext,
       plan: &Plan,
   ) -> Result<LogicalPlan> ...
   ```
   function and then recursively invokes other utility functions like
   ```rust
   pub async fn from_substrait_rel(
       ctx: &SessionContext,
       rel: &Rel,
       extensions: &Extensions,
   ) -> Result<LogicalPlan> ...
   
   pub async fn from_substrait_rex(
       ctx: &SessionContext,
       e: &Expression,
       input_schema: &DFSchema,
       extensions: &Extensions,
   ) -> Result<Expr>
   
   pub async fn from_substrait_sorts(
       ctx: &SessionContext,
       substrait_sorts: &Vec<SortField>,
       input_schema: &DFSchema,
       extensions: &Extensions,
   ) -> Result<Vec<Sort>> ...
   ```
   This API threads a SessionContext and Extension context through all the 
calls, even though they are only needed in a handful of functions.
   
   We could potentially add more parameters for the handlers, or even a struct 
of handlers to pass through, but I think it would be better to refactor the 
consumer code to be associated with a struct and associate the handlers with 
that instead.
   
   A design I have though about and experimented with is something like:
   
   ```rust
   trait SubstraitConsumer {
       fn get_context() -> SessionContext;
       fn get_extensions() -> Extensions;
       
       async fn from_substrait_rel(rel: &Rel) -> Result<LogicalPlan> {
           <default impl>
       }
   
       async fn from_substrait_rex(
           e: &Expression,
           input_schema: &DFSchema
       ) -> Result<LogicalPlan> {
           <default impl>
       }
   
       async fn from_substrait_sorts(
           ctx: &SessionContext,
           substrait_sorts: &Vec<SortField>,
           input_schema: &DFSchema,
           extensions: &Extensions,
       ) -> Result<Vec<Sort>> {
           <default impl>
       }
       ...
   
       // user extension
       async fn from_extension_single_rel(e: &ExtensionSingleRel) -> 
Result<LogicalPlan> {
           <default impl which produces an error>
       }
       async fn from_user_defined_type(ud: &UserDefined) -> Result<DataType> {
           <default impl which produces an error>
       }
       ...
   }
   ```
   
   The default implementations would be sufficient to decode plans with no 
user-defined extensions, but anyone using custom extensions, types, etc would 
need to provide their handlings.
   
   I'm still exploring other alternatives, but thought it would be worthwhile 
to start a discussion around this and gather peoples thoughts and feedback.
   
   ## API Breakage Concerns
   As part of this work, I'm expecting a certain amount of API breakage as I 
don't expect to be able we'll be able to nail the design of this in one go.
   
   What we can do though is limit the amount of API churn for folks who only 
care about *standard* Substrait. For example 
   ```rust
   pub async fn from_substrait_plan(
       ctx: &SessionContext,
       plan: &Plan,
   ) -> Result<LogicalPlan> ...
   ```
   can still be used as the entry point for most users.
   
   We can minimize breakage for these users while we iterate and improve the 
consumer. Folks who want to use their own extensions will likely experience 
more breakage as we iterate, but at the moment they can't use them at all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [substrait] improve handling for user-defined functionality [datafusion]

Reply via email to