ccciudatu commented on PR #13772: URL: https://github.com/apache/datafusion/pull/13772#issuecomment-2592737482
Thanks, @vbarua. And sorry about the delay. > @ccciudatu do you have an example of any standard and/or built-in DataFusion functionality that can't be serialised to Substrait without this boilerplate? No, I don't think any standard/built-in component is ever supposed to end up as an `ExtensionTable` or `ExtensionRel` of any kind. However, since we do have explicit support for `UserDefinedNodeType`s, it kind of makes sense to add support for user-defined tables as well. Trying to represent custom tables as `ExtensionLeafRel`s in Substrait and `UserDefinedNodeType`s has several limitations that I hinted at in the associated issue, e.g. the table context information gets lost in an `ExtensionLeafRel` as opposed to a `ReadRel`, `ExtensionPlanner::plan_extension` is not a suitable wrapper around `TableProvider::scan`, both producers and consumers need to pre/post-process the logical plans before/after conversion etc. > Is it primarily or only UserDefineTypeRels? Yes, this is only meant to address user-defined / custom `TableProvider` implementations, and only the ones that cannot be restored by name alone -- otherwise a `NamedTable` read_type would suffice. > I'm realising that I don't have a good idea of what this would actually be needed for in standard DataFusion. If there is existing functionality that needs this kind of handling to work, I think it makes sense to include it in the default consumer and producer. Again, I'm not sure if you consider extension points like `UserDefinedLogicalNode` and `TableProvider` as part of "standard DataFusion". However, I see them as equally legit, i.e. if there's room for the first in the default producer/consumer, the latter should also be achievable with just a custom serializer. > Most of my experience with ExtensionTables in Substrait (and hence biases) comes from generating plans outside of DataFusion where we have table-like sources, like embedding a SQL query for another system or fetching data from an internal cache store, which aren't part of standard Substrait (and are also highly-specific to our use case and deployments). For stuff like this we define our own handling because we don't expect DataFusion to be able to handle our custom messages. One use-case that I'm trying to address with this is a UDTF that integrates a legacy OLAP system that we're using as a data source for both DataFusion and Spark, using Substrait to federate the query plan as we see fit. An SQL query like `select * from legacy_olap_system(arg1, arg2, ...)` results in a `TableScan` node that references a custom `TableProvider` implementation and has the name "tmp_table" assigned by DataFusion. Encoding/decoding such a plan to/from Substrait with the current producer/consumer results in the equivalent of `select * from tmp_table` and errors out with a "table not found" message. With the code in this patch and a custom `SerializerRegistry`, such a custom table will encode its internal state as proto and will get fully restored in a different DataFusion context (or a different system altogether). > > a custom TableProvider implementation is hardly aware of the names under which users choose to register the corresponding tables (especially for UDTFs) > > That's somewhat my argument for not supporting custom stuff generally. The only people who can know what names they used are the ones generating the plans and wiring in the UDTFs in the first place. As long as they have the hooks in place to wire things up how they want, we can service their needs. I agree this doesn't feel exactly right. The Substrait spec for extension tables is quite poorly defined, so to recreate the exact same plan (including names) on roundtrips we do require either small hacks or a more complex interface. However, this is not a hard requirement and the code accepts any compliant input. I can even remove this hack for now and wait/request for enhancements in the Substrait spec, if this is too ugly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org