ccciudatu commented on PR #13772:
URL: https://github.com/apache/datafusion/pull/13772#issuecomment-2592737482

   Thanks, @vbarua. And sorry about the delay.
   
   
   > @ccciudatu do you have an example of any standard and/or built-in 
DataFusion functionality that can't be serialised to Substrait without this 
boilerplate?
   
   No, I don't think any standard/built-in component is ever supposed to end up 
as an `ExtensionTable` or `ExtensionRel` of any kind.
   However, since we do have explicit support for `UserDefinedNodeType`s, it 
kind of makes sense to add support for user-defined tables as well. Trying to 
represent custom tables as `ExtensionLeafRel`s in Substrait and 
`UserDefinedNodeType`s has several limitations that I hinted at in the 
associated issue, e.g. the table context information gets lost in an 
`ExtensionLeafRel` as opposed to a `ReadRel`, 
`ExtensionPlanner::plan_extension` is not a suitable wrapper around 
`TableProvider::scan`, both producers and consumers need to pre/post-process 
the logical plans before/after conversion etc.
   
   > Is it primarily or only UserDefineTypeRels?
   
   Yes, this is only meant to address user-defined / custom `TableProvider` 
implementations, and only the ones that cannot be restored by name alone -- 
otherwise a `NamedTable` read_type would suffice.
   
   
   > I'm realising that I don't have a good idea of what this would actually be 
needed for in standard DataFusion. If there is existing functionality that 
needs this kind of handling to work, I think it makes sense to include it in 
the default consumer and producer.
   
   Again, I'm not sure if you consider extension points like 
`UserDefinedLogicalNode` and `TableProvider` as part of "standard DataFusion".  
However, I see them as equally legit, i.e. if there's room for the first in the 
default producer/consumer, the latter should also be achievable with just a 
custom serializer.
   
   > Most of my experience with ExtensionTables in Substrait (and hence biases) 
comes from generating plans outside of DataFusion where we have table-like 
sources, like embedding a SQL query for another system or fetching data from an 
internal cache store, which aren't part of standard Substrait (and are also 
highly-specific to our use case and deployments). For stuff like this we define 
our own handling because we don't expect DataFusion to be able to handle our 
custom messages.
   
   One use-case that I'm trying to address with this is a UDTF that integrates 
a legacy OLAP system that we're using as a data source for both DataFusion and 
Spark, using Substrait to federate the query plan as we see fit.
   
   An SQL query like `select * from legacy_olap_system(arg1, arg2, ...)` 
results in a `TableScan` node that references a custom `TableProvider` 
implementation and has the name "tmp_table" assigned by DataFusion.
   Encoding/decoding such a plan to/from Substrait with the current 
producer/consumer results in the equivalent of `select * from tmp_table` and 
errors out with a "table not found" message.
   
   With the code in this patch and a custom `SerializerRegistry`, such a custom 
table will encode its internal state as proto and will get fully restored in a 
different DataFusion context (or a different system altogether).
   
   > > a custom TableProvider implementation is hardly aware of the names under 
which users choose to register the corresponding tables (especially for UDTFs)
   > 
   > That's somewhat my argument for not supporting custom stuff generally. The 
only people who can know what names they used are the ones generating the plans 
and wiring in the UDTFs in the first place. As long as they have the hooks in 
place to wire things up how they want, we can service their needs.
   
   I agree this doesn't feel exactly right. The Substrait spec for extension 
tables is quite poorly defined, so to recreate the exact same plan (including 
names) on roundtrips we do require either small hacks or a more complex 
interface.
   However, this is not a hard requirement and the code accepts any compliant 
input. I can even remove this hack for now and wait/request for enhancements in 
the Substrait spec, if this is too ugly.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to