s quite useful.
> > > > > >
> > > > > > RE:
> > > > > > "
> > > > > > At the moment, I'm not aware of much is needed in the execution
> > > engine
> > > > > > beyond what we have. If you (or
t; def add_one(s: pd.Series) -> pd.Series:
> > > > > return s + 1
> > > > > t = t.mutate(new_col=add_one(t[col]))
> > > > > ```
> > > > >
> > > > > Now I am not sure exactly what additional work needs to be done in
> &g
> > compute"?
> > > >
> > > >
> > > >
> > > > On Fri, May 6, 2022 at 4:28 AM Yaron Gvili
> wrote:
> > > >
> > > >> The general design seems reasonable to me. However, I think the
> > > >> multithreading issue
CPU and GPU compute resources?
> > >> 2. How to schedule compute threads as a set/unit for a locally
> running
> > >> UDF that requires them?
> > >> 3. How to express the number (and CPU/GPU kind) of local compute
> > >> threads that a UDF re
unning
> >> in local processes external to Arrow?
> >>
> >> Since this is quite a complex topic, instead of trying to support
> >> multiple UDF threading facilities within Arrow, I think it might be easier,
> >> at least as a first approach, to expose in Arrow
a threading-model control
>> API (I'm not aware of such an API, if one exists). An Arrow-with-UDFs user,
>> who best knows the compute needs of the UDFs in the execution plan, would
>> then be able to allocate compute resources to Arrow through this API and
>> separately
gt; [2] https://www.openmp.org/wp-content/uploads/SC19-Iwasaki-Threads.pdf
>
>
> Yaron.
>
> ____________
> From: Weston Pace
> Sent: Thursday, May 5, 2022 4:30 PM
> To: dev@arrow.apache.org
> Subject: Re: RFC: Out of Process Python UDFs in Arr
Pace
Sent: Thursday, May 5, 2022 4:30 PM
To: dev@arrow.apache.org
Subject: Re: RFC: Out of Process Python UDFs in Arrow Compute
Yes, I think you have the right understanding. That is what I was
originally trying to say by pointing out that our current solution
does not solve the serialization p
Yes, I think you have the right understanding. That is what I was
originally trying to say by pointing out that our current solution
does not solve the serialization problem.
I think there is probably room for multiple approaches to this
problem. Each will have their own tradeoffs. For example,
After reading the above PR, I think I understand the approach a bit more
now.
If I understand this correctly, the above UDF functionality is similar to
what I have in mind. The main difference seems to be "where and how are the
UDF executed"
(1) In the PR above, the UDF is passed to the Compute e
"def pandas_rank(ctx, arr):
series = arr.to_pandas()
rank = series.rank()
return pa.array(rank)
"
Oh nice! I didn't get that from the original PR and does look this is
closer to the problem I am trying to solve. At this point I will understand
more about that PR and see if what I propo
> However, if I
> understand correctly, the UDF implemented in the PR above are still
> "composition of existing C++ kernels" instead of
> "arbitrary pandas/numpy" code, so it kind of resolves a
> different problem IMO.
That is not a correct understanding, though it is an understandable
one as we
Weston - Yes I have seen the pull request above (very cool). However, if I
understand correctly, the UDF implemented in the PR above are still
"composition of existing C++ kernels" instead of "arbitrary pandas/numpy"
code, so it kind of resolves a different problem IMO.
For example, if I want to c
Hi Li, have you seen the python UDF prototype that we just recently
merged into the execution engine at [1]? It adds support for scalar
UDFs.
Comparing your proposal to what we've done so far I would ask:
1. Why do you want to run these UDFs in a separate process? Is this
for robustness (if th
14 matches
Mail list logo