Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-07-17 Thread Andrew Lamb
An update here is that one of the DataFusion contributors, @xinlifoobar, did a very neat prototype of using arrow-udf in DataFusion[1] and wrote up their findings[2] The major findings are that it would be possible, though it would take some additional work (e.g. single values, making the function

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-07-07 Thread Andrew Lamb
Thank you for the summary Felipe. Your description and suggestion sounds reasonable to me. In term of federated querying across services, perhaps that is something that more naturally fits in with the substrait[1] project 🤔 Andrew [1]: https://substrait.io/ On Thu, Jul 4, 2024 at 3:35 PM Felip

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-07-04 Thread Felipe Oliveira Carvalho
Hi Andrew, During the Arrow Community Meeting I asked Xuanwo many questions trying to clarify my understanding of what they mean by "UDF". To me and you it seems to mean "user defined compute kernels", but in the context of these libraries it's *also that* plus the ability to call these functions

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-07-03 Thread Andrew Lamb
What does everyone think about renaming this library to something like `arrow-auto-vectorizer` or `arrow-functions` to emphasize its role with codegen of vectorized implementations? In discussing this proposal internally, it took a while to explain what the usecase of the library is >From my unde

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-07-01 Thread Xuanwo
I have cross-posted the proposal to datafusion community to collect more feedback: https://github.com/apache/datafusion/discussions/11192 On Mon, Jul 1, 2024, at 19:31, Andrew Lamb wrote: > I have been thinking about this project more, and the more I think about it > the more I like it. > > For

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-07-01 Thread Andrew Lamb
I have been thinking about this project more, and the more I think about it the more I like it. For example of the kind of leverage a library like this might bring, we might consider changing the implementation of Arrow UDF to re-use the underlying buffers when possible (e.g. via unary_mut[1]). Th

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-06-28 Thread Xuanwo
> That said, wherever it ends up, there should be the agreement of > individuals to accept maintenance of it. Since it's in rust, that would > generally fall to the arrow-rs contributors and/or the DataFusion > contributors IMO. > > It would be good for it to be part of the community, but only if i

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-06-28 Thread Matt Topol
> This UDF implementation doesn’t depend on DataFusion. It can work with any data in the arrow format. Given this I'm in agreement with Antoine that it would be weird for it to be maintained within the DataFusion repo as opposed to it's own repo (as we've done in the past for things like nanoarrow

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-06-28 Thread Xuanwo
Hi, This UDF implementation doesn’t depend on DataFusion. It can work with any data in the arrow format. It has the potential power to make users write ONE UDF function that works for different query engines as we showed up in databend and risingwave. So I personally think it should be part o

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-06-28 Thread Antoine Pitrou
Is this UDF implementation based on DataFusion? If so, it makes sense for it to be part of the DataFusion project. OTOH, if it can work with any data in the Arrow format, then it would sound weird to maintain it in the DataFusion repo IMHO. Regards Antoine. Le 28/06/2024 à 21:52, Andrew

Re: Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-06-28 Thread Andrew Lamb
To be clear, if the arrow community thinks this would be better organized / administered in the Apache DataFusion project (especially if it is aligned with Rust) I think it would be good to discuss donating there On Fri, Jun 28, 2024 at 3:17 PM Andrew Lamb wrote: > I think there are two aspects:

Re: Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-06-28 Thread Andrew Lamb
I think there are two aspects: 1. The actual mechanics of implementing functions 2. The actual library of udf functions (e.g. sin, cos, nullif, etc) I agree 2 is not something that belongs naturally in the arrow project and is better aligned with query engines However I think 1 is worth consideri

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-06-28 Thread Antoine Pitrou
I'll note that PyArrow also allows defining user-defined functions and they are vectorized (the function arguments can be PyArrow arrays or scalars, depending on the context in which a function is being executed): https://arrow.apache.org/docs/python/compute.html#user-defined-functions My vo

RE: Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-06-28 Thread Raphael Taylor-Davies
I wonder if the DataFusion project might be a more natural home for this functionality? UDFs are more of a query engine concept, whereas arrow-rs is more focused on purely physical execution? On 28 June 2024 19:41:39 BST, Runji Wang wrote: >Hi Felipe, > >Vectorization will be applied whenever p

RE: Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-06-28 Thread Runji Wang
Hi Felipe, Vectorization will be applied whenever possible. When all input and output types of a function are primitive (int16, int32, int64, float32, float64) and do not involve any Option or Result, the macro will automatically generate code based on unary

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-06-28 Thread Julian Hyde
In some ways, the problem of a UDF framework is larger than Arrow. UDFs need to give the same results, and execute efficiently, regardless of the platform (e.g. Arrow), hosting language, and UDF language. At SIGMOD there was a paper from TU Berlin that addresses this problem: "Query Compilation

RE: Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-06-28 Thread Runji Wang
Hi All, I am the initiator of this project. Thanks Xuanwo for helping to promote it and start this discussion. Regarding the location of the code, I prefer to keep everything in the same repository rather than spreading it across various language binding libraries. The current implementations

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-06-28 Thread Felipe Oliveira Carvalho
On Fri, Jun 28, 2024 at 11:07 AM Andrew Lamb wrote: > > Hi Xuanwo, > > Sorry for the delay in responding. I think the ability to easily write > functions that "feel" like native functions in whatever language and be > able to generate arrow / vectorized versions of them is quite valuable. > This

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-06-28 Thread Andrew Lamb
Hi Xuanwo, Sorry for the delay in responding. I think the ability to easily write functions that "feel" like native functions in whatever language and be able to generate arrow / vectorized versions of them is quite valuable. This is my understanding of what this proposal is about. I left some a