Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

Xuanwo Mon, 01 Jul 2024 05:10:48 -0700

I have cross-posted the proposal to datafusion community to collect more 
feedback:


https://github.com/apache/datafusion/discussions/11192

On Mon, Jul 1, 2024, at 19:31, Andrew Lamb wrote:
> I have been thinking about this project more, and the more I think about it
> the more I like it.
>
> For example of the kind of leverage a library like this might bring, we
> might consider changing the implementation of Arrow UDF to re-use the
> underlying buffers when possible (e.g. via unary_mut[1]). This would likely
> provide an across the board efficiency improvement for no costs to
> downstream crates.
>
> Andrew
>
> [1]:
> https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.unary_mut
>
> On Sat, Jun 29, 2024 at 1:47 AM Xuanwo <[email protected]> wrote:
>
>> > That said, wherever it ends up, there should be the agreement of
>> > individuals to accept maintenance of it. Since it's in rust, that would
>> > generally fall to the arrow-rs contributors and/or the DataFusion
>> > contributors IMO.
>> >
>> > It would be good for it to be part of the community, but only if it's not
>> > going to end up just bitrotting somewhere.
>>
>> Thanks Matt. This concern does make sense.
>>
>> Arrow UDF is extensively used within RisingWave and Databend. We, the
>> initial
>> committers from both RisingWave and Databend, are eager to take
>> responsibility
>> for maintaining these crates.
>>
>> Additionally, some of us are involved in other Apache Projects, so we
>> understand
>> how the Apache Way functions. We will focus on community growth to ensure
>> this
>> project remains active.
>>
>> On Sat, Jun 29, 2024, at 13:29, Matt Topol wrote:
>> >> This UDF implementation doesn’t depend on DataFusion. It can work with
>> > any data in the arrow format.
>> >
>> > Given this I'm in agreement with Antoine that it would be weird for it to
>> > be maintained within the DataFusion repo as opposed to it's own repo (as
>> > we've done in the past for things like nanoarrow and arrow-experiments).
>> >
>> > That said, wherever it ends up, there should be the agreement of
>> > individuals to accept maintenance of it. Since it's in rust, that would
>> > generally fall to the arrow-rs contributors and/or the DataFusion
>> > contributors IMO.
>> >
>> > It would be good for it to be part of the community, but only if it's not
>> > going to end up just bitrotting somewhere.
>> >
>> > --Matt
>> >
>> > On Fri, Jun 28, 2024, 8:49 PM Xuanwo <[email protected]> wrote:
>> >
>> >> Hi,
>> >>
>> >> This UDF implementation doesn’t depend on DataFusion. It can work with
>> any
>> >> data in the arrow format.
>> >>
>> >> It has the potential power to make users write ONE UDF function that
>> works
>> >> for different query engines as we showed up in databend and risingwave.
>> >>
>> >> So I personally think it should be part of arrow community.
>> >>
>> >> On Sat, Jun 29, 2024, at 05:06, Antoine Pitrou wrote:
>> >> > Is this UDF implementation based on DataFusion? If so, it makes sense
>> >> > for it to be part of the DataFusion project.
>> >> >
>> >> > OTOH, if it can work with any data in the Arrow format, then it would
>> >> > sound weird to maintain it in the DataFusion repo IMHO.
>> >> >
>> >> > Regards
>> >> >
>> >> > Antoine.
>> >> >
>> >> >
>> >> > Le 28/06/2024 à 21:52, Andrew Lamb a écrit :
>> >> >> To be clear, if the arrow community thinks this would be better
>> >> organized /
>> >> >> administered in the Apache DataFusion project (especially if it is
>> >> aligned
>> >> >> with Rust) I think it would be good to discuss donating there
>> >> >>
>> >> >> On Fri, Jun 28, 2024 at 3:17 PM Andrew Lamb <[email protected]>
>> >> wrote:
>> >> >>
>> >> >>> I think there are two aspects:
>> >> >>> 1. The actual mechanics of implementing functions
>> >> >>> 2. The actual library of udf functions (e.g. sin, cos, nullif, etc)
>> >> >>>
>> >> >>> I agree 2 is not something that belongs naturally in the arrow
>> project
>> >> and
>> >> >>> is better aligned with query engines
>> >> >>>
>> >> >>> However I think 1 is worth considering.
>> >> >>>
>> >> >>> As I understand it, the problem arrow_udf solves is avoiding some of
>> >> the
>> >> >>> boilerplate  required to make vectorized udfs. So instead of
>> writing a
>> >> >>> special eval_gcd function like this
>> >> >>>
>> >> >>> ```
>> >> >>> fn gcd(l: i64, r: i64) -> i64 {
>> >> >>>   // do gcd calculation
>> >> >>> }
>> >> >>>
>> >> >>> // implement vectorized version
>> >> >>> fn eval_gcd(left: &ArrayRef, right: &ArrayRef) -> ArrayRef {
>> >> >>>    let left = left.as_primitive<Int64Type>();
>> >> >>>    let right = right.as_primitive<Int64Type>();
>> >> >>>    res = binary(left, right, |l, r| gcd(l, r));
>> >> >>>    Arc::new(res)
>> >> >>> }
>> >> >>> ```
>> >> >>>
>> >> >>> The user simply annotates the scalar function and have the library
>> code
>> >> >>> gen the array version
>> >> >>> ```
>> >> >>> #[function("gcd(int64, int64) -> int64", output = "eval_gcd")]
>> >> >>> fn gcd(l: i64, r: i64) -> i64 {
>> >> >>>   // do gcd calculation
>> >> >>> }
>> >> >>> ```
>> >> >>>
>> >> >>> We have a lot of boilerplate / non idea macro stuff in DataFusion
>> that
>> >> I
>> >> >>> think this would help a lot.
>> >> >>>
>> >> >>> Andrew
>> >> >>>
>> >> >>>
>> >> >>> On Fri, Jun 28, 2024 at 3:08 PM Raphael Taylor-Davies
>> >> >>> <[email protected]> wrote:
>> >> >>>
>> >> >>>> I wonder if the DataFusion project might be a more natural home for
>> >> this
>> >> >>>> functionality? UDFs are more of a query engine concept, whereas
>> >> arrow-rs is
>> >> >>>> more focused on purely physical execution?
>> >> >>>>
>> >> >>>> On 28 June 2024 19:41:39 BST, Runji Wang <[email protected]>
>> >> wrote:
>> >> >>>>> Hi Felipe,
>> >> >>>>>
>> >> >>>>> Vectorization will be applied whenever possible. When all input
>> and
>> >> >>>> output types of a function are primitive (int16, int32, int64,
>> >> float32,
>> >> >>>> float64) and do not involve any Option or Result, the macro will
>> >> >>>> automatically generate code based on unary <
>> >> >>>> https://docs.rs/arrow/latest/arrow/compute/fn.unary.html> or
>> binary <
>> >> >>>> https://docs.rs/arrow/latest/arrow/compute/fn.binary.html>
>> kernels,
>> >> >>>> which potentially allows for vectorization.
>> >> >>>>>
>> >> >>>>> Both examples you showed are not vectorized. The `div` function is
>> >> due
>> >> >>>> to the Result output, while `gcd` is due to the loop in its
>> >> implementation.
>> >> >>>> However, if the function is simple enough, like an `add` function:
>> >> >>>>>
>> >> >>>>> #[function("add(int, int) -> int")]
>> >> >>>>> fn add(a: i32, b: i32) -> i32 {
>> >> >>>>>     a + b
>> >> >>>>> }
>> >> >>>>>
>> >> >>>>> It can be auto-vectorized by llvm.
>> >> >>>>>
>> >> >>>>> Runji
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> On 2024/06/28 17:13:16 Felipe Oliveira Carvalho wrote:
>> >> >>>>>> On Fri, Jun 28, 2024 at 11:07 AM Andrew Lamb <
>> [email protected]>
>> >> >>>> wrote:
>> >> >>>>>>>
>> >> >>>>>>> Hi Xuanwo,
>> >> >>>>>>>
>> >> >>>>>>> Sorry for the delay in responding. I think  the ability to
>> easily
>> >> >>>> write
>> >> >>>>>>> functions that "feel" like native functions in whatever language
>> >> and
>> >> >>>> be
>> >> >>>>>>> able to generate arrow / vectorized versions of them is quite
>> >> >>>> valuable.
>> >> >>>>>>> This is my understanding of what this proposal is about.
>> >> >>>>>>
>> >> >>>>>> My understanding is that it's not vectorized. From the examples
>> in
>> >> >>>>>> risingwavelabs/arrow-udf, <
>> >> https://github.com/risingwavelabs/arrow-udf>
>> >> >>>> it
>> >> >>>>>> looks like the macros generate code that gathers values from
>> columns
>> >> >>>> into
>> >> >>>>>> local scalars that are passed as scalar parameters to user
>> >> functions.
>> >> >>>> Is
>> >> >>>>>> the hope here that rustc/llvm will auto-vectorize the code?
>> >> >>>>>>
>> >> >>>>>> #[function("gcd(int, int) -> int")]
>> >> >>>>>> fn gcd(mut a: i32, mut b: i32) -> i32 {
>> >> >>>>>>      while b != 0 {
>> >> >>>>>>          (a, b) = (b, a % b);
>> >> >>>>>>      }
>> >> >>>>>>      a
>> >> >>>>>> }
>> >> >>>>>>
>> >> >>>>>> #[function("div(int, int) -> int")]
>> >> >>>>>> fn div(x: i32, y: i32) -> Result<i32, &'static str> {
>> >> >>>>>>      if y == 0 {
>> >> >>>>>>          return Err("division by zero");
>> >> >>>>>>      }
>> >> >>>>>>      Ok(x / y)
>> >> >>>>>> }
>> >> >>>>>>
>> >> >>>>>>> I left some additional comments on the markdown.
>> >> >>>>>>>
>> >> >>>>>>> One thing that might be worth doing is articulate some other
>> >> >>>> potential
>> >> >>>>>>> locations for where the code might go. One option, as I think
>> you
>> >> >>>> propose,
>> >> >>>>>>> is to make its own repository.  Another option could be to
>> donate
>> >> >>>> the code
>> >> >>>>>>> and put the various language bindings in the same repo as the
>> arrow
>> >> >>>>>>> language implementations (e.g arrow-rs, arrow for python, etc)
>> >> which
>> >> >>>> would
>> >> >>>>>>> likely make it easier to maintain and discover.
>> >> >>>>>>>
>> >> >>>>>>> I am curious about what other devs / users feel about this?
>> >> >>>>>>>
>> >> >>>>>>> Andrew
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> On Thu, Jun 20, 2024 at 3:04 AM Xuanwo <[email protected]>
>> wrote:
>> >> >>>>>>>
>> >> >>>>>>>> Hello, everyone.
>> >> >>>>>>>>
>> >> >>>>>>>> I start this thread to disscuss the donation of a User-Defined
>> >> >>>> Function
>> >> >>>>>>>> Framework for Apache Arrow.
>> >> >>>>>>>>
>> >> >>>>>>>> Feel free to review and leave your comments here. For live
>> review,
>> >> >>>>>> please
>> >> >>>>>>>> visit:
>> >> >>>>>>>>
>> >> >>>>>>>> https://hackmd.io/@xuanwo/apache-arrow-udf
>> >> >>>>>>>>
>> >> >>>>>>>> The original content also pasted here for a quick reading:
>> >> >>>>>>>>
>> >> >>>>>>>> ------
>> >> >>>>>>>>
>> >> >>>>>>>> ## Abstract
>> >> >>>>>>>>
>> >> >>>>>>>> Arrow UDF is a User-Defined Function Framework for Apache
>> Arrow.
>> >> >>>>>>>>
>> >> >>>>>>>> ## Proposal
>> >> >>>>>>>>
>> >> >>>>>>>> Arrow UDF allows user to easily create and run user-defined
>> >> >>>> functions
>> >> >>>>>>>> (UDF) in Rust, Python, Java or JavaScript based on Apache
>> Arrow.
>> >> >>>> The
>> >> >>>>>>>> functions can be executed natively, or in WebAssembly, or in a
>> >> >>>> remote
>> >> >>>>>>>> server via Arrow Flight.
>> >> >>>>>>>>
>> >> >>>>>>>> Arrow UDF was originally designed to be used by the RisingWave
>> >> >>>> project
>> >> >>>>>> but
>> >> >>>>>>>> is now being used by Databend and several database startups.
>> >> >>>>>>>>
>> >> >>>>>>>> We believe that the Arrow UDF project will provide diversity
>> value
>> >> >>>> to
>> >> >>>>>> the
>> >> >>>>>>>> entire Arrow community.
>> >> >>>>>>>>
>> >> >>>>>>>> ## Background
>> >> >>>>>>>>
>> >> >>>>>>>> Arrow UDF is being developed by an open-source community from
>> day
>> >> >>>> one
>> >> >>>>>> and
>> >> >>>>>>>> is owned by RisingWaveLabs. The project has been launched in
>> >> >>>> December
>> >> >>>>>> 2023.
>> >> >>>>>>>>
>> >> >>>>>>>> ## Initial Goals
>> >> >>>>>>>>
>> >> >>>>>>>> By transferring ownership of the project to the Apache Arrow,
>> >> >>>> Arrow UDF
>> >> >>>>>>>> expects to ensure its neutrality and further encourage and
>> >> >>>> facilitate
>> >> >>>>>> the
>> >> >>>>>>>> adoption of Arrow UDF by the community.
>> >> >>>>>>>>
>> >> >>>>>>>> ## Current Status
>> >> >>>>>>>>
>> >> >>>>>>>> Contributors: 5
>> >> >>>>>>>>
>> >> >>>>>>>> Users:
>> >> >>>>>>>>
>> >> >>>>>>>> -   [RisingWave]: A Distributed SQL Database for Stream
>> >> Processing.
>> >> >>>>>>>> -   [Databend]: An open-source cloud data warehouse that
>> serves as
>> >> >>>> a
>> >> >>>>>>>> cost-effective alternative to Snowflake.
>> >> >>>>>>>>
>> >> >>>>>>>> ## Documentation
>> >> >>>>>>>>
>> >> >>>>>>>> The document of Arrow UDF is hosted at
>> >> >>>>>>>> https://docs.rs/arrow-udf/latest/arrow_udf/.
>> >> >>>>>>>>
>> >> >>>>>>>> ## Initial Source
>> >> >>>>>>>>
>> >> >>>>>>>> The project currently holds a GitHub repository and multiple
>> >> >>>> packages:
>> >> >>>>>>>>
>> >> >>>>>>>> - https://github.com/risingwavelabs/arrow-udf
>> >> >>>>>>>>
>> >> >>>>>>>> Rust:
>> >> >>>>>>>>
>> >> >>>>>>>> - https://crates.io/arrow-udf/
>> >> >>>>>>>> - https://crates.io/arrow-udf-python/
>> >> >>>>>>>> - https://crates.io/arrow-udf-js/
>> >> >>>>>>>> - https://crates.io/arrow-udf-js-deno/
>> >> >>>>>>>> - https://crates.io/arrow-udf-wasm/
>> >> >>>>>>>>
>> >> >>>>>>>> Python:
>> >> >>>>>>>>
>> >> >>>>>>>> - https://pypi.org/project/arrow-udf/
>> >> >>>>>>>>
>> >> >>>>>>>> Those packge will retain its name, while the repository will be
>> >> >>>> moved to
>> >> >>>>>>>> apache org.
>> >> >>>>>>>>
>> >> >>>>>>>> ## Required Resources
>> >> >>>>>>>>
>> >> >>>>>>>> ### Mailing Lists
>> >> >>>>>>>>
>> >> >>>>>>>> We can reuse the existing mailing lists that arrow have.
>> >> >>>>>>>>
>> >> >>>>>>>> ### Git Repositories
>> >> >>>>>>>>
>> >> >>>>>>>> From
>> >> >>>>>>>>
>> >> >>>>>>>> - https://github.com/risingwavelabs/arrow-udf
>> >> >>>>>>>>
>> >> >>>>>>>> To
>> >> >>>>>>>>
>> >> >>>>>>>> - https://gitbox.apache.org/asf/repos/arrow-udf
>> >> >>>>>>>> - https://github.com/apache/arrow-udf
>> >> >>>>>>>>
>> >> >>>>>>>> ### Issue Tracking
>> >> >>>>>>>>
>> >> >>>>>>>> The project would like to continue using GitHub Issues.
>> >> >>>>>>>>
>> >> >>>>>>>> ### Other Resources
>> >> >>>>>>>>
>> >> >>>>>>>> The project has already chosen GitHub actions as continuous
>> >> >>>> integration
>> >> >>>>>>>> tools.
>> >> >>>>>>>>
>> >> >>>>>>>> ## Initial Committers
>> >> >>>>>>>>
>> >> >>>>>>>> - Runji Wang [email protected]
>> >> >>>>>>>> - Giovanny Gutiérrez
>> >> >>>>>>>> - sundy-li [email protected]
>> >> >>>>>>>> - Xuanwo [email protected]
>> >> >>>>>>>> - Max Justus Spransy [email protected]
>> >> >>>>>>>>
>> >> >>>>>>>> [RisingWave]: https://github.com/risingwavelabs/risingwave
>> >> >>>>>>>> [Databend]: https://github.com/datafuselabs/databend
>> >> >>>>>>>>
>> >> >>>>>>>> Xuanwo
>> >> >>>>>>>>
>> >> >>>>>>
>> >> >>>
>> >> >>>
>> >> >>
>> >>
>> >> --
>> >> Xuanwo
>> >>
>>
>> --
>> Xuanwo
>>

-- 
Xuanwo

https://xuanwo.io/

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

Reply via email to