Thanks Weston. > Good questions. I haven't given this a whole lot of thought yet so > take these suggestions with a grain of salt. Probably the minimal > code change approach would be to implement [2] and call the unregister > from the ExecPlan's destructor. However, this is going to eventually > lead to naming conflicts if multiple UDFs are running in parallel. We > could generate unique names to work around this. We could also > consider a second function registry, unique to the plan, combined with > the master function registry via some kind of facade registry that > prefer's local UDFs and falls back to process UDFs. I think that > might be the best long term solution.
If I understand this correctly, with the [1], we can potentially deserialize the UDF in the Python process calling the Python consumer API (likely to be the user Python process), call PyArrow API to register the UDF, (maybe with a temporary name with UUID to avoid conflict) then construct a UDF ExecNode with that name? I think this would not but not sure if I missed anything. I think a second registry would help and be more clean, but with the current MVP purpose the temporary name + unregister from destructor probably works. On Mon, May 16, 2022 at 9:53 PM Weston Pace <weston.p...@gmail.com> wrote: > 1. This is probably best discussed on the Ibis mailing list (or > Github? I'm not sure how the Ibis community communicates). > > 2. If we consider "encoding the embedded function in Substrait" to be > part of (2) then this can be discussed in the Substrait community. > That being said, in the sync call last week we had a brief discussion > and to start with the EmbeddedFunction message and add an "any" > version and then propose a PR to standardize it. We will need to find > some spot to encode the embedded function into the plan itself > however. > > 3. A PR is in progress (pretty close to merging) to expose the > Substrait consumer to python. I think this will be good enough for > MVP purposes. Check out [1]. > > > we are not sure how to proceed with (3), so wondering how to > > register/execute the deserialize the UDF from substrait and execute it > > using existing UDF code. Any thoughts? > > Good questions. I haven't given this a whole lot of thought yet so > take these suggestions with a grain of salt. Probably the minimal > code change approach would be to implement [2] and call the unregister > from the ExecPlan's destructor. However, this is going to eventually > lead to naming conflicts if multiple UDFs are running in parallel. We > could generate unique names to work around this. We could also > consider a second function registry, unique to the plan, combined with > the master function registry via some kind of facade registry that > prefer's local UDFs and falls back to process UDFs. I think that > might be the best long term solution. > > [1] https://github.com/apache/arrow/pull/12672 > [2] https://issues.apache.org/jira/browse/ARROW-16211 > > On Mon, May 16, 2022 at 5:06 AM Li Jin <ice.xell...@gmail.com> wrote: > > > > For context, this is a continuation of a previous email discussion: "RPC: > > Out of Process Python UDFs in Arrow Compute" where we identified > reasonable > > steps to solve are: > > ostef Process Python UDFs > > (1) A way to serialize the user function (and function parameters) (in > the > > doc I proposed to use cloudpickle based approach similar to how Spark > does > > it) (2) Add a Substrait relation/expression that captures the the UDF (3) > > Deserialize the Substrait relation/expression in Arrow compute and > execute > > the UDF (either using the approach in the current Scalar UDF prototype or > > do sth else) > > (Same as the Yaron layout above). > > > > Now I think we have reasonable solutions are (1) and (2) (at least for > PoC > > purpose), but we are not sure how to proceed with (3), so wondering how > to > > register/execute the deserialize the UDF from substrait and execute it > > using existing UDF code. Any thoughts? > > > > Li > > > > > > On Sun, May 15, 2022 at 8:02 AM Yaron Gvili <rt...@hotmail.com> wrote: > > > > > Hi, > > > > > > I'm working on a Python UDFs PoC and would like to get the community's > > > feedback on its design. > > > > > > The goal of this PoC is to enable a user to integrate Python UDFs in an > > > Ibis/Substrait/Arrow workflow. The basic idea is that the user would > create > > > an Ibis expression that includes Python UDFs implemented by the user, > then > > > use a single invocation to: > > > > > > 1. produce a Substrait plan for the Ibis expression; > > > 2. deliver the Substrait plan to Arrow; > > > 3. consume the Substrait plan in Arrow to obtain an execution plan; > > > 4. run the execution plan; > > > 5. deliver back the resulting table outputs of the plan. > > > > > > I already have the above steps implemented, with support for some > existing > > > execution nodes and compute functions in Arrow, but without the > support for > > > Python UDFs, which is the focus of this discussion. To add this > support I > > > am thinking about a design where: > > > > > > * Ibis is extended to support an expression of type Python UDF, > which > > > carries (text-form) serialized UDF code (likely using cloudpickle) > > > * Ibis-Substrait is extended to support an extension function of > type > > > Python UDF, which also carries (text-within-protobuf-form) serialized > UDF > > > code > > > * PyArrow is extended to process a Substrait plan that includes > > > serialized Python UDFs using these steps: > > > * Deserialize the Python UDFs > > > * Register the Python UDFs (preferably, in a local scope that is > > > cleared after processing ends) > > > * Run the execution plan corresponding to the Substrait plan > > > * Return the resulting table outputs of the plan > > > > > > Note that this design requires the user to drive PyArrow. It does not > > > support driving Arrow C++ to process such a Substrait plan (it does not > > > start an embedded Python interpreter). > > > > > > > > > Cheers, > > > Yaron. > > > >