timsaucer commented on issue #1181: URL: https://github.com/apache/datafusion-python/issues/1181#issuecomment-3065925816
I've been spending some time today looking at this. I'm try to capture the exact use case and how it might work for the implementation. Suppose we want to create a library `IcebergDF`. This library will be used in both Rust and Python implementations. Within this library we have `IcebergTableProvider` which is a struct that implements `TableProvider`. We also have `IcebergLogicalCodec` which implements `LogicalExtensionCodec`. For the codec, the only two methods it supports are `try_encode_table_provider` and `try_decode_table_provider`. We have created a logical plan in `datafusion-python` that includes a table from an `IcebergTableProvider` instance. This was something along the lines of. ```python my_iceberg_table = IcebergTable(somevars) ctx.register_table("my_table", my_iceberg_table) df = ctx.table("my_table") plan = df.logical_plan() ``` In this snippet we want to call `plan.to_proto` to get a serialized plan bytes that we can send along to some remote parts of our larger system. What we want is the `to_proto` method to have an updated signature to be like ```python def to_proto(self, codecs: *LogicalExtensionCodec) -> bytes: ``` In this case we would pass `plan.to_proto(my_codec)` where `my_codec` is an instance of `IcebergLogicalCodec`. `IcebergLogicalCodec` is a python object that has a function `__datafusion_logical_extension_codec__` that will give the PyCapsule `FFI_LogicalExtensionCodec`. Now we run into a our first problem: Even though we are passing an Iceberg table provider created *within the same library* as this extension codec, it cannot tell. We have passed the FFI table provider object back and forth from the Iceberg library to df-python and back. I have thought of two approaches, but open to others: - My first thought was to add `try_encode` to `FFI_TableProvider` and also `ForeignTableProvider`. Then the `IcebergLogicalCodec` can do a downcast to `ForeignTableProvider` and call through the methods to `try_encode`. This would require then that `IcebergTableProvider` *also* implement a trait letting us know it can doe the encoding. This seems *very* round-about and easy to implement incorrectly. - My second though, prompted by a LLM suggestion, was to add in an optional magic number to `FFI_TableProvider`. Then in `IcebergLogicalCodec` it can check to see if the magic number is the same, indicating that we have been passed back a pointer from the same providing library. *If* that magic number matches, then we know it is safe to access directly the inner `Arc<dyn TableProvider>` by `IcebergLogicalCodec`. Supposing one of those two approaches works, then we should be a the point where we can encode our table provider for later processing. Now for the second half, we want to decode this process in either another python work flow or in a rust work flow. Focusing on the python side, since I *think* the rust side would be even more straightforward, we want to make a corresponding update to our `from_proto` signature. ```python @staticmethod def from_proto(ctx: SessionContext, data: bytes, codecs: *LogicalExtensionCodec) -> LogicalPlan: ``` In this case `IcebergLogicalCodec` already knows about `IcebergTableProvider` since it is specifically designed to encode/decode these table providers. In our datafusion-python implementation of `from_proto` on the rust side, we access via FFI the codec(s). Then we iterate through them until one is successful. If none are successful we try the default. Should they all fail, we return an error. If this all makes sense to you, then it seems like the path forward, which I have bits and pieces of implemented in a branch on my computer are: 1. Add FFI_LogicalExtensionCodec which is probably fine to currently limit to table providers. I'd almost immediately want to extend it to udf, udaf, and udwf. 2. Update all of the FFI structs to include an optional magic number with a description of how to use these. 3. Update the `to_proto` and `from_proto` methods as describe above to iterate through provided codecs. 4. Create an example implementation in the datafusion-repo that demonstrates full end to end creation of a plan, storing it to bytes, decoding in a different session context, and executing the plan. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org