timsaucer commented on issue #1181:
URL: 
https://github.com/apache/datafusion-python/issues/1181#issuecomment-3065925816

   I've been spending some time today looking at this. I'm try to capture the 
exact use case and how it might work for the implementation.
   
   Suppose we want to create a library `IcebergDF`. This library will be used 
in both Rust and Python implementations. Within this library we have 
`IcebergTableProvider` which is a struct that implements `TableProvider`. We 
also have `IcebergLogicalCodec` which implements `LogicalExtensionCodec`. For 
the codec, the only two methods it supports are `try_encode_table_provider` and 
`try_decode_table_provider`.
   
   We have created a logical plan in `datafusion-python` that includes a table 
from an `IcebergTableProvider` instance. This was something along the lines of.
   
   ```python
   my_iceberg_table = IcebergTable(somevars)
   ctx.register_table("my_table", my_iceberg_table)
   df = ctx.table("my_table")
   plan = df.logical_plan()
   ```
   
   In this snippet we want to call `plan.to_proto` to get a serialized plan 
bytes that we can send along to some remote parts of our larger system.
   
   What we want is the `to_proto` method to have an updated signature to be like
   
   ```python
   def to_proto(self, codecs: *LogicalExtensionCodec) -> bytes:
   ```
   
   In this case we would pass `plan.to_proto(my_codec)` where `my_codec` is an 
instance of `IcebergLogicalCodec`.
   
   `IcebergLogicalCodec` is a python object that has a function 
`__datafusion_logical_extension_codec__` that will give the PyCapsule 
`FFI_LogicalExtensionCodec`. Now we run into a our first problem: Even though 
we are passing an Iceberg table provider created *within the same library* as 
this extension codec, it cannot tell. We have passed the FFI table provider 
object back and forth from the Iceberg library to df-python and back.
   
   I have thought of two approaches, but open to others:
   
   - My first thought was to add `try_encode` to `FFI_TableProvider` and also 
`ForeignTableProvider`. Then the `IcebergLogicalCodec` can do a downcast to 
`ForeignTableProvider` and call through the methods to `try_encode`. This would 
require then that `IcebergTableProvider` *also* implement a trait letting us 
know it can doe the encoding. This seems *very* round-about and easy to 
implement incorrectly.
   - My second though, prompted by a LLM suggestion, was to add in an optional 
magic number to `FFI_TableProvider`. Then in `IcebergLogicalCodec` it can check 
to see if the magic number is the same, indicating that we have been passed 
back a pointer from the same providing library. *If* that magic number matches, 
then we know it is safe to access directly the inner `Arc<dyn TableProvider>` 
by `IcebergLogicalCodec`.
   
   Supposing one of those two approaches works, then we should be a the point 
where we can encode our table provider for later processing.
   
   Now for the second half, we want to decode this process in either another 
python work flow or in a rust work flow. Focusing on the python side, since I 
*think* the rust side would be even more straightforward, we want to make a 
corresponding update to our `from_proto` signature.
   
   ```python
   @staticmethod
   def from_proto(ctx: SessionContext, data: bytes, codecs: 
*LogicalExtensionCodec) -> LogicalPlan:
   ```
   
   In this case `IcebergLogicalCodec` already knows about 
`IcebergTableProvider` since it is specifically designed to encode/decode these 
table providers. In our datafusion-python implementation of `from_proto` on the 
rust side, we access via FFI the codec(s). Then we iterate through them until 
one is successful. If none are successful we try the default. Should they all 
fail, we return an error.
   
   
   If this all makes sense to you, then it seems like the path forward, which I 
have bits and pieces of implemented in a branch on my computer are:
   
   1. Add FFI_LogicalExtensionCodec which is probably fine to currently limit 
to table providers. I'd almost immediately want to extend it to udf, udaf, and 
udwf.
   2. Update all of the FFI structs to include an optional magic number with a 
description of how to use these.
   3. Update the `to_proto` and `from_proto` methods as describe above to 
iterate through provided codecs.
   4. Create an example implementation in the datafusion-repo that demonstrates 
full end to end creation of a plan, storing it to bytes, decoding in a 
different session context, and executing the plan.
   
   What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to