Weston - Yes I have seen the pull request above (very cool). However, if I understand correctly, the UDF implemented in the PR above are still "composition of existing C++ kernels" instead of "arbitrary pandas/numpy" code, so it kind of resolves a different problem IMO.
For example, if I want to compute "normal distribution percentile rank", I cannot quite use the scalar UDF above to do that (we would need to add a kernel similar to scipy.stats.norm.ppf). 1. Why do you want to run these UDFs in a separate process? Is this for robustness (if the UDF crashes the process recovery is easier) or performance (potentially working around the GIL if you have multiple python processes)? Have you given much thought to how the data would travel back and forth between the processes? For example, via sockets (maybe flight) or shared memory? The main reason that I want to run UDFs in a separate process is that this is the only way that I think can run any arbitrary pandas/numpy code (assuming these are stateless pure functions). It does have side benefits like it won't crash the compute engine if the UDF is buggy. With trade off being less performant and involves managing additional python worker process. I don't have much though on data communication channels, but I think sth like a local socket should work reasonably well, and bonus if we can have less copy / shared memory. >From my experience the overhead that is OK considering these UDFs often involves more complicated math or relation computations (otherwise it doesn't need to be UDF at all). "2. The current implementation doesn't address serialization of UDFs." Sorry I am not sure what your question is - do you mean the proposal here doesn't address serialization of UDFs and you want to ask about that? "I'm not sure a separate ExecNode would be necessary. So far we've implemented UDFs at the kernel function level and I think we can continue to do that, even if we are calling out of process workers." I am not sure either. This can perhaps be a kernel instead of a node. The reason that I put this a node right now is because I think this would involves life cycle management for python workers so it doesn't feel like a "kernel" to me. But I am not strong on that. On Wed, May 4, 2022 at 4:06 PM Weston Pace <weston.p...@gmail.com> wrote: > Hi Li, have you seen the python UDF prototype that we just recently > merged into the execution engine at [1]? It adds support for scalar > UDFs. > > Comparing your proposal to what we've done so far I would ask: > > 1. Why do you want to run these UDFs in a separate process? Is this > for robustness (if the UDF crashes the process recovery is easier) or > performance (potentially working around the GIL if you have multiple > python processes)? Have you given much thought to how the data would > travel back and forth between the processes? For example, via sockets > (maybe flight) or shared memory? > 2. The current implementation doesn't address serialization of UDFs. > > I'm not sure a separate ExecNode would be necessary. So far we've > implemented UDFs at the kernel function level and I think we can > continue to do that, even if we are calling out of process workers. > > [1] https://github.com/apache/arrow/pull/12590 > > On Wed, May 4, 2022 at 6:12 AM Li Jin <ice.xell...@gmail.com> wrote: > > > > Hello, > > > > I have a somewhat controversial idea to introduce a "bridge" solution for > > Python UDFs in Arrow Compute and have write up my thoughts in this > proposal: > > > > > https://docs.google.com/document/d/1s7Gchq_LoNuiZO5bHq9PZx9RdoCWSavuS58KrTYXVMU/edit?usp=sharing > > > > I am curious to hear what the community thinks about this. (I am ready to > > take criticism :) ) > > > > I wrote this in just 1-2 hours so I'm happy to explain anything that is > > unclear. > > > > Appreciate your feedback, > > Li >