Re: RFC: Out of Process Python UDFs in Arrow Compute

2022-05-11 Thread Li Jin
s quite useful. > > > > > > > > > > > > RE: > > > > > > " > > > > > > At the moment, I'm not aware of much is needed in the execution > > > engine > > > > > > beyond what we have. If you (or

Re: RFC: Out of Process Python UDFs in Arrow Compute

2022-05-10 Thread Vibhatha Abeykoon
t; def add_one(s: pd.Series) -> pd.Series: > > > > > return s + 1 > > > > > t = t.mutate(new_col=add_one(t[col])) > > > > > ``` > > > > > > > > > > Now I am not sure exactly what additional work needs to be done in > &g

Re: RFC: Out of Process Python UDFs in Arrow Compute

2022-05-09 Thread Li Jin
> > compute"? > > > > > > > > > > > > > > > > On Fri, May 6, 2022 at 4:28 AM Yaron Gvili > wrote: > > > > > > > >> The general design seems reasonable to me. However, I think the > > > >> multithreading issue

Re: RFC: Out of Process Python UDFs in Arrow Compute

2022-05-06 Thread Vibhatha Abeykoon
CPU and GPU compute resources? > > >> 2. How to schedule compute threads as a set/unit for a locally > running > > >> UDF that requires them? > > >> 3. How to express the number (and CPU/GPU kind) of local compute > > >> threads that a UDF re

Re: RFC: Out of Process Python UDFs in Arrow Compute

2022-05-06 Thread Weston Pace
unning > >> in local processes external to Arrow? > >> > >> Since this is quite a complex topic, instead of trying to support > >> multiple UDF threading facilities within Arrow, I think it might be easier, > >> at least as a first approach, to expose in Arrow

Re: RFC: Out of Process Python UDFs in Arrow Compute

2022-05-06 Thread Li Jin
a threading-model control >> API (I'm not aware of such an API, if one exists). An Arrow-with-UDFs user, >> who best knows the compute needs of the UDFs in the execution plan, would >> then be able to allocate compute resources to Arrow through this API and >> separately

Re: RFC: Out of Process Python UDFs in Arrow Compute

2022-05-06 Thread Li Jin
gt; [2] https://www.openmp.org/wp-content/uploads/SC19-Iwasaki-Threads.pdf > > > Yaron. > > ____________ > From: Weston Pace > Sent: Thursday, May 5, 2022 4:30 PM > To: dev@arrow.apache.org > Subject: Re: RFC: Out of Process Python UDFs in Arr

Re: RFC: Out of Process Python UDFs in Arrow Compute

2022-05-06 Thread Yaron Gvili
Pace Sent: Thursday, May 5, 2022 4:30 PM To: dev@arrow.apache.org Subject: Re: RFC: Out of Process Python UDFs in Arrow Compute Yes, I think you have the right understanding. That is what I was originally trying to say by pointing out that our current solution does not solve the serialization p

Re: RFC: Out of Process Python UDFs in Arrow Compute

2022-05-05 Thread Weston Pace
Yes, I think you have the right understanding. That is what I was originally trying to say by pointing out that our current solution does not solve the serialization problem. I think there is probably room for multiple approaches to this problem. Each will have their own tradeoffs. For example,

Re: RFC: Out of Process Python UDFs in Arrow Compute

2022-05-05 Thread Li Jin
After reading the above PR, I think I understand the approach a bit more now. If I understand this correctly, the above UDF functionality is similar to what I have in mind. The main difference seems to be "where and how are the UDF executed" (1) In the PR above, the UDF is passed to the Compute e

Re: RFC: Out of Process Python UDFs in Arrow Compute

2022-05-05 Thread Li Jin
"def pandas_rank(ctx, arr): series = arr.to_pandas() rank = series.rank() return pa.array(rank) " Oh nice! I didn't get that from the original PR and does look this is closer to the problem I am trying to solve. At this point I will understand more about that PR and see if what I propo

Re: RFC: Out of Process Python UDFs in Arrow Compute

2022-05-04 Thread Weston Pace
> However, if I > understand correctly, the UDF implemented in the PR above are still > "composition of existing C++ kernels" instead of > "arbitrary pandas/numpy" code, so it kind of resolves a > different problem IMO. That is not a correct understanding, though it is an understandable one as we

Re: RFC: Out of Process Python UDFs in Arrow Compute

2022-05-04 Thread Li Jin
Weston - Yes I have seen the pull request above (very cool). However, if I understand correctly, the UDF implemented in the PR above are still "composition of existing C++ kernels" instead of "arbitrary pandas/numpy" code, so it kind of resolves a different problem IMO. For example, if I want to c

Re: RFC: Out of Process Python UDFs in Arrow Compute

2022-05-04 Thread Weston Pace
Hi Li, have you seen the python UDF prototype that we just recently merged into the execution engine at [1]? It adds support for scalar UDFs. Comparing your proposal to what we've done so far I would ask: 1. Why do you want to run these UDFs in a separate process? Is this for robustness (if th