Hi, I'm working on a Python UDFs PoC and would like to get the community's feedback on its design.
The goal of this PoC is to enable a user to integrate Python UDFs in an Ibis/Substrait/Arrow workflow. The basic idea is that the user would create an Ibis expression that includes Python UDFs implemented by the user, then use a single invocation to: 1. produce a Substrait plan for the Ibis expression; 2. deliver the Substrait plan to Arrow; 3. consume the Substrait plan in Arrow to obtain an execution plan; 4. run the execution plan; 5. deliver back the resulting table outputs of the plan. I already have the above steps implemented, with support for some existing execution nodes and compute functions in Arrow, but without the support for Python UDFs, which is the focus of this discussion. To add this support I am thinking about a design where: * Ibis is extended to support an expression of type Python UDF, which carries (text-form) serialized UDF code (likely using cloudpickle) * Ibis-Substrait is extended to support an extension function of type Python UDF, which also carries (text-within-protobuf-form) serialized UDF code * PyArrow is extended to process a Substrait plan that includes serialized Python UDFs using these steps: * Deserialize the Python UDFs * Register the Python UDFs (preferably, in a local scope that is cleared after processing ends) * Run the execution plan corresponding to the Substrait plan * Return the resulting table outputs of the plan Note that this design requires the user to drive PyArrow. It does not support driving Arrow C++ to process such a Substrait plan (it does not start an embedded Python interpreter). Cheers, Yaron.