I have cross-posted the proposal to datafusion community to collect more feedback:
https://github.com/apache/datafusion/discussions/11192 On Mon, Jul 1, 2024, at 19:31, Andrew Lamb wrote: > I have been thinking about this project more, and the more I think about it > the more I like it. > > For example of the kind of leverage a library like this might bring, we > might consider changing the implementation of Arrow UDF to re-use the > underlying buffers when possible (e.g. via unary_mut[1]). This would likely > provide an across the board efficiency improvement for no costs to > downstream crates. > > Andrew > > [1]: > https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.unary_mut > > On Sat, Jun 29, 2024 at 1:47 AM Xuanwo <xua...@apache.org> wrote: > >> > That said, wherever it ends up, there should be the agreement of >> > individuals to accept maintenance of it. Since it's in rust, that would >> > generally fall to the arrow-rs contributors and/or the DataFusion >> > contributors IMO. >> > >> > It would be good for it to be part of the community, but only if it's not >> > going to end up just bitrotting somewhere. >> >> Thanks Matt. This concern does make sense. >> >> Arrow UDF is extensively used within RisingWave and Databend. We, the >> initial >> committers from both RisingWave and Databend, are eager to take >> responsibility >> for maintaining these crates. >> >> Additionally, some of us are involved in other Apache Projects, so we >> understand >> how the Apache Way functions. We will focus on community growth to ensure >> this >> project remains active. >> >> On Sat, Jun 29, 2024, at 13:29, Matt Topol wrote: >> >> This UDF implementation doesn’t depend on DataFusion. It can work with >> > any data in the arrow format. >> > >> > Given this I'm in agreement with Antoine that it would be weird for it to >> > be maintained within the DataFusion repo as opposed to it's own repo (as >> > we've done in the past for things like nanoarrow and arrow-experiments). >> > >> > That said, wherever it ends up, there should be the agreement of >> > individuals to accept maintenance of it. Since it's in rust, that would >> > generally fall to the arrow-rs contributors and/or the DataFusion >> > contributors IMO. >> > >> > It would be good for it to be part of the community, but only if it's not >> > going to end up just bitrotting somewhere. >> > >> > --Matt >> > >> > On Fri, Jun 28, 2024, 8:49 PM Xuanwo <xua...@apache.org> wrote: >> > >> >> Hi, >> >> >> >> This UDF implementation doesn’t depend on DataFusion. It can work with >> any >> >> data in the arrow format. >> >> >> >> It has the potential power to make users write ONE UDF function that >> works >> >> for different query engines as we showed up in databend and risingwave. >> >> >> >> So I personally think it should be part of arrow community. >> >> >> >> On Sat, Jun 29, 2024, at 05:06, Antoine Pitrou wrote: >> >> > Is this UDF implementation based on DataFusion? If so, it makes sense >> >> > for it to be part of the DataFusion project. >> >> > >> >> > OTOH, if it can work with any data in the Arrow format, then it would >> >> > sound weird to maintain it in the DataFusion repo IMHO. >> >> > >> >> > Regards >> >> > >> >> > Antoine. >> >> > >> >> > >> >> > Le 28/06/2024 à 21:52, Andrew Lamb a écrit : >> >> >> To be clear, if the arrow community thinks this would be better >> >> organized / >> >> >> administered in the Apache DataFusion project (especially if it is >> >> aligned >> >> >> with Rust) I think it would be good to discuss donating there >> >> >> >> >> >> On Fri, Jun 28, 2024 at 3:17 PM Andrew Lamb <al...@influxdata.com> >> >> wrote: >> >> >> >> >> >>> I think there are two aspects: >> >> >>> 1. The actual mechanics of implementing functions >> >> >>> 2. The actual library of udf functions (e.g. sin, cos, nullif, etc) >> >> >>> >> >> >>> I agree 2 is not something that belongs naturally in the arrow >> project >> >> and >> >> >>> is better aligned with query engines >> >> >>> >> >> >>> However I think 1 is worth considering. >> >> >>> >> >> >>> As I understand it, the problem arrow_udf solves is avoiding some of >> >> the >> >> >>> boilerplate required to make vectorized udfs. So instead of >> writing a >> >> >>> special eval_gcd function like this >> >> >>> >> >> >>> ``` >> >> >>> fn gcd(l: i64, r: i64) -> i64 { >> >> >>> // do gcd calculation >> >> >>> } >> >> >>> >> >> >>> // implement vectorized version >> >> >>> fn eval_gcd(left: &ArrayRef, right: &ArrayRef) -> ArrayRef { >> >> >>> let left = left.as_primitive<Int64Type>(); >> >> >>> let right = right.as_primitive<Int64Type>(); >> >> >>> res = binary(left, right, |l, r| gcd(l, r)); >> >> >>> Arc::new(res) >> >> >>> } >> >> >>> ``` >> >> >>> >> >> >>> The user simply annotates the scalar function and have the library >> code >> >> >>> gen the array version >> >> >>> ``` >> >> >>> #[function("gcd(int64, int64) -> int64", output = "eval_gcd")] >> >> >>> fn gcd(l: i64, r: i64) -> i64 { >> >> >>> // do gcd calculation >> >> >>> } >> >> >>> ``` >> >> >>> >> >> >>> We have a lot of boilerplate / non idea macro stuff in DataFusion >> that >> >> I >> >> >>> think this would help a lot. >> >> >>> >> >> >>> Andrew >> >> >>> >> >> >>> >> >> >>> On Fri, Jun 28, 2024 at 3:08 PM Raphael Taylor-Davies >> >> >>> <r.taylordav...@googlemail.com.invalid> wrote: >> >> >>> >> >> >>>> I wonder if the DataFusion project might be a more natural home for >> >> this >> >> >>>> functionality? UDFs are more of a query engine concept, whereas >> >> arrow-rs is >> >> >>>> more focused on purely physical execution? >> >> >>>> >> >> >>>> On 28 June 2024 19:41:39 BST, Runji Wang <wangrunji0...@163.com> >> >> wrote: >> >> >>>>> Hi Felipe, >> >> >>>>> >> >> >>>>> Vectorization will be applied whenever possible. When all input >> and >> >> >>>> output types of a function are primitive (int16, int32, int64, >> >> float32, >> >> >>>> float64) and do not involve any Option or Result, the macro will >> >> >>>> automatically generate code based on unary < >> >> >>>> https://docs.rs/arrow/latest/arrow/compute/fn.unary.html> or >> binary < >> >> >>>> https://docs.rs/arrow/latest/arrow/compute/fn.binary.html> >> kernels, >> >> >>>> which potentially allows for vectorization. >> >> >>>>> >> >> >>>>> Both examples you showed are not vectorized. The `div` function is >> >> due >> >> >>>> to the Result output, while `gcd` is due to the loop in its >> >> implementation. >> >> >>>> However, if the function is simple enough, like an `add` function: >> >> >>>>> >> >> >>>>> #[function("add(int, int) -> int")] >> >> >>>>> fn add(a: i32, b: i32) -> i32 { >> >> >>>>> a + b >> >> >>>>> } >> >> >>>>> >> >> >>>>> It can be auto-vectorized by llvm. >> >> >>>>> >> >> >>>>> Runji >> >> >>>>> >> >> >>>>> >> >> >>>>> On 2024/06/28 17:13:16 Felipe Oliveira Carvalho wrote: >> >> >>>>>> On Fri, Jun 28, 2024 at 11:07 AM Andrew Lamb < >> al...@influxdata.com> >> >> >>>> wrote: >> >> >>>>>>> >> >> >>>>>>> Hi Xuanwo, >> >> >>>>>>> >> >> >>>>>>> Sorry for the delay in responding. I think the ability to >> easily >> >> >>>> write >> >> >>>>>>> functions that "feel" like native functions in whatever language >> >> and >> >> >>>> be >> >> >>>>>>> able to generate arrow / vectorized versions of them is quite >> >> >>>> valuable. >> >> >>>>>>> This is my understanding of what this proposal is about. >> >> >>>>>> >> >> >>>>>> My understanding is that it's not vectorized. From the examples >> in >> >> >>>>>> risingwavelabs/arrow-udf, < >> >> https://github.com/risingwavelabs/arrow-udf> >> >> >>>> it >> >> >>>>>> looks like the macros generate code that gathers values from >> columns >> >> >>>> into >> >> >>>>>> local scalars that are passed as scalar parameters to user >> >> functions. >> >> >>>> Is >> >> >>>>>> the hope here that rustc/llvm will auto-vectorize the code? >> >> >>>>>> >> >> >>>>>> #[function("gcd(int, int) -> int")] >> >> >>>>>> fn gcd(mut a: i32, mut b: i32) -> i32 { >> >> >>>>>> while b != 0 { >> >> >>>>>> (a, b) = (b, a % b); >> >> >>>>>> } >> >> >>>>>> a >> >> >>>>>> } >> >> >>>>>> >> >> >>>>>> #[function("div(int, int) -> int")] >> >> >>>>>> fn div(x: i32, y: i32) -> Result<i32, &'static str> { >> >> >>>>>> if y == 0 { >> >> >>>>>> return Err("division by zero"); >> >> >>>>>> } >> >> >>>>>> Ok(x / y) >> >> >>>>>> } >> >> >>>>>> >> >> >>>>>>> I left some additional comments on the markdown. >> >> >>>>>>> >> >> >>>>>>> One thing that might be worth doing is articulate some other >> >> >>>> potential >> >> >>>>>>> locations for where the code might go. One option, as I think >> you >> >> >>>> propose, >> >> >>>>>>> is to make its own repository. Another option could be to >> donate >> >> >>>> the code >> >> >>>>>>> and put the various language bindings in the same repo as the >> arrow >> >> >>>>>>> language implementations (e.g arrow-rs, arrow for python, etc) >> >> which >> >> >>>> would >> >> >>>>>>> likely make it easier to maintain and discover. >> >> >>>>>>> >> >> >>>>>>> I am curious about what other devs / users feel about this? >> >> >>>>>>> >> >> >>>>>>> Andrew >> >> >>>>>>> >> >> >>>>>>> >> >> >>>>>>> >> >> >>>>>>> On Thu, Jun 20, 2024 at 3:04 AM Xuanwo <xu...@apache.org> >> wrote: >> >> >>>>>>> >> >> >>>>>>>> Hello, everyone. >> >> >>>>>>>> >> >> >>>>>>>> I start this thread to disscuss the donation of a User-Defined >> >> >>>> Function >> >> >>>>>>>> Framework for Apache Arrow. >> >> >>>>>>>> >> >> >>>>>>>> Feel free to review and leave your comments here. For live >> review, >> >> >>>>>> please >> >> >>>>>>>> visit: >> >> >>>>>>>> >> >> >>>>>>>> https://hackmd.io/@xuanwo/apache-arrow-udf >> >> >>>>>>>> >> >> >>>>>>>> The original content also pasted here for a quick reading: >> >> >>>>>>>> >> >> >>>>>>>> ------ >> >> >>>>>>>> >> >> >>>>>>>> ## Abstract >> >> >>>>>>>> >> >> >>>>>>>> Arrow UDF is a User-Defined Function Framework for Apache >> Arrow. >> >> >>>>>>>> >> >> >>>>>>>> ## Proposal >> >> >>>>>>>> >> >> >>>>>>>> Arrow UDF allows user to easily create and run user-defined >> >> >>>> functions >> >> >>>>>>>> (UDF) in Rust, Python, Java or JavaScript based on Apache >> Arrow. >> >> >>>> The >> >> >>>>>>>> functions can be executed natively, or in WebAssembly, or in a >> >> >>>> remote >> >> >>>>>>>> server via Arrow Flight. >> >> >>>>>>>> >> >> >>>>>>>> Arrow UDF was originally designed to be used by the RisingWave >> >> >>>> project >> >> >>>>>> but >> >> >>>>>>>> is now being used by Databend and several database startups. >> >> >>>>>>>> >> >> >>>>>>>> We believe that the Arrow UDF project will provide diversity >> value >> >> >>>> to >> >> >>>>>> the >> >> >>>>>>>> entire Arrow community. >> >> >>>>>>>> >> >> >>>>>>>> ## Background >> >> >>>>>>>> >> >> >>>>>>>> Arrow UDF is being developed by an open-source community from >> day >> >> >>>> one >> >> >>>>>> and >> >> >>>>>>>> is owned by RisingWaveLabs. The project has been launched in >> >> >>>> December >> >> >>>>>> 2023. >> >> >>>>>>>> >> >> >>>>>>>> ## Initial Goals >> >> >>>>>>>> >> >> >>>>>>>> By transferring ownership of the project to the Apache Arrow, >> >> >>>> Arrow UDF >> >> >>>>>>>> expects to ensure its neutrality and further encourage and >> >> >>>> facilitate >> >> >>>>>> the >> >> >>>>>>>> adoption of Arrow UDF by the community. >> >> >>>>>>>> >> >> >>>>>>>> ## Current Status >> >> >>>>>>>> >> >> >>>>>>>> Contributors: 5 >> >> >>>>>>>> >> >> >>>>>>>> Users: >> >> >>>>>>>> >> >> >>>>>>>> - [RisingWave]: A Distributed SQL Database for Stream >> >> Processing. >> >> >>>>>>>> - [Databend]: An open-source cloud data warehouse that >> serves as >> >> >>>> a >> >> >>>>>>>> cost-effective alternative to Snowflake. >> >> >>>>>>>> >> >> >>>>>>>> ## Documentation >> >> >>>>>>>> >> >> >>>>>>>> The document of Arrow UDF is hosted at >> >> >>>>>>>> https://docs.rs/arrow-udf/latest/arrow_udf/. >> >> >>>>>>>> >> >> >>>>>>>> ## Initial Source >> >> >>>>>>>> >> >> >>>>>>>> The project currently holds a GitHub repository and multiple >> >> >>>> packages: >> >> >>>>>>>> >> >> >>>>>>>> - https://github.com/risingwavelabs/arrow-udf >> >> >>>>>>>> >> >> >>>>>>>> Rust: >> >> >>>>>>>> >> >> >>>>>>>> - https://crates.io/arrow-udf/ >> >> >>>>>>>> - https://crates.io/arrow-udf-python/ >> >> >>>>>>>> - https://crates.io/arrow-udf-js/ >> >> >>>>>>>> - https://crates.io/arrow-udf-js-deno/ >> >> >>>>>>>> - https://crates.io/arrow-udf-wasm/ >> >> >>>>>>>> >> >> >>>>>>>> Python: >> >> >>>>>>>> >> >> >>>>>>>> - https://pypi.org/project/arrow-udf/ >> >> >>>>>>>> >> >> >>>>>>>> Those packge will retain its name, while the repository will be >> >> >>>> moved to >> >> >>>>>>>> apache org. >> >> >>>>>>>> >> >> >>>>>>>> ## Required Resources >> >> >>>>>>>> >> >> >>>>>>>> ### Mailing Lists >> >> >>>>>>>> >> >> >>>>>>>> We can reuse the existing mailing lists that arrow have. >> >> >>>>>>>> >> >> >>>>>>>> ### Git Repositories >> >> >>>>>>>> >> >> >>>>>>>> From >> >> >>>>>>>> >> >> >>>>>>>> - https://github.com/risingwavelabs/arrow-udf >> >> >>>>>>>> >> >> >>>>>>>> To >> >> >>>>>>>> >> >> >>>>>>>> - https://gitbox.apache.org/asf/repos/arrow-udf >> >> >>>>>>>> - https://github.com/apache/arrow-udf >> >> >>>>>>>> >> >> >>>>>>>> ### Issue Tracking >> >> >>>>>>>> >> >> >>>>>>>> The project would like to continue using GitHub Issues. >> >> >>>>>>>> >> >> >>>>>>>> ### Other Resources >> >> >>>>>>>> >> >> >>>>>>>> The project has already chosen GitHub actions as continuous >> >> >>>> integration >> >> >>>>>>>> tools. >> >> >>>>>>>> >> >> >>>>>>>> ## Initial Committers >> >> >>>>>>>> >> >> >>>>>>>> - Runji Wang wangrunji0...@163.com >> >> >>>>>>>> - Giovanny Gutiérrez >> >> >>>>>>>> - sundy-li sund...@apache.org >> >> >>>>>>>> - Xuanwo xua...@apache.org >> >> >>>>>>>> - Max Justus Spransy maxjus...@gmail.com >> >> >>>>>>>> >> >> >>>>>>>> [RisingWave]: https://github.com/risingwavelabs/risingwave >> >> >>>>>>>> [Databend]: https://github.com/datafuselabs/databend >> >> >>>>>>>> >> >> >>>>>>>> Xuanwo >> >> >>>>>>>> >> >> >>>>>> >> >> >>> >> >> >>> >> >> >> >> >> >> >> -- >> >> Xuanwo >> >> >> >> -- >> Xuanwo >> -- Xuanwo https://xuanwo.io/