Re: [DISCUSS] Donation of a Spark native engine based on DataFusion & Arrow

Micah Kornfield Thu, 11 Jan 2024 22:05:19 -0800

It sounds like there is likely enough support for this to move forward, I'd
guess next steps are to work on the donation process/vote.  Probably
someone more involved with DataFusion should help drive this effort?


On Thu, Jan 11, 2024 at 12:55 PM L. C. Hsieh <vii...@gmail.com> wrote:

> Spark as a widely used computation engine in industry, has its
> momentum from developers and users.
>
> I believe that the integration with DataFusion, not only can help
> drive Spark through next level high performance with
> a new native execution engine, but also can attract more developer
> attention into the development of DataFusion.
>
> Although it serves Spark developers and users from a high level point
> of view, its underlying technologies are mostly tightly coupled with
> DataFusion and arrow-rs projects. Which makes it naturally fit into
> the same governance and development efforts of DataFusion, this
> is also a more efficient way to achieve seamless communication between
> them.
>
> In the near future, I'm looking forward to seeing more contribution
> and development interaction happen in the DataFusion and arrow-rs
> with the donation of this project.
>
> So I'm supportive of this donation.
>
> Disclosure: I'm currently working on this project along with Chao and
> Parth.
>
> On Thu, Jan 11, 2024 at 9:33 AM Parth Chandra <par...@apache.org> wrote:
> >
> > Full disclosure: I worked on the original value vector implementation
> that
> > became Apache arrow and currently work with Chao, et al on the native
> > engine that is being discussed.
> > I believe that integration of DataFusion with Spark will drive both
> > development and user interest in arrow-rs and DataFusion. Personally, I
> > find the idea of this contribution driving increased interest in
> DataFusion
> > and Ballista very exciting.
> > Given projects like Gluten, blaze-rs, and of course RAPIDS, it would be
> > great to get the community involved in benchmarking, and comparing the
> > various implementations.
> >
> > On Thu, Jan 11, 2024 at 5:51 AM Andrew Lamb <al...@influxdata.com>
> wrote:
> >
> > > I am very supportive of this donation. I know of at least one other
> > > DataFusion-based project, blaze-rs[1], which has the same design goal
> and
> > > bringing this project into the ASF may help consolidate these efforts
> > >
> > > As Andy said, I believe it was very valuable to have a major consumer
> > > project (e.g. DataFusion) to help drive the definition and
> implementation
> > > of arrow-rs implementation. We never achieved the same synergy with
> > > Ballista and DataFusion but I think it is more likely with a more
> actively
> > > maintained Spark accelerator.
> > >
> > > I am not sure it affects this discussion, but the Gluten project,
> based on
> > > Velox, was accepted yesterday[2] into the Apache Incubator[2].  While
> the
> > > functionality may be similar, the technology (Rust vs C/C++) and the
> > > communities are different so having both in the same (big) tent of the
> ASF
> > > doesn't seem concerning to me.
> > >
> > > Also, as Chao says, I think this new sub project would naturally move
> to a
> > > new DataFusion top level project when we get there (we plan a proposed
> > > resolution April ASF board meeting)
> > >
> > > Looking forward to seeing more!
> > > Andrew
> > >
> > > [1]: https://github.com/blaze-init/blaze
> > > [2]: https://lists.apache.org/thread/6lrozds10jn9gknj9rf74lqbh7j55pq6
> > >
> > > On Wed, Jan 10, 2024 at 5:10 PM Andy Grove <andygrov...@gmail.com>
> wrote:
> > >
> > > > Hi Chao,
> > > >
> > > > This sounds like a really interesting project. I am interested in
> seeing
> > > > how it compares to Spark RAPIDS (the project that I work on at
> NVIDIA)
> > > and
> > > > Intel's Gluten project (that works with Velox).
> > > >
> > > > I can see the following benefits of having this project being under
> > > Apache
> > > > Arrow governance:
> > > >
> > > > - Assuming that this is a drop-in replacement that doesn't require
> users
> > > to
> > > > change their code (as I imagine is the case), then it could lead to
> > > greater
> > > > adoption of DataFusion, especially for more demanding use cases where
> > > > processing on a single node is not possible.
> > > > - Given that it has a deep integration with the Rust implementation
> of
> > > > Arrow as well as DataFusion, and given the overlap of committers
> between
> > > > these projects, having them under the same governance and
> communication
> > > > channels will generally be more efficient than if this project is
> > > separate.
> > > > - Hopefully this leads to more upstream contributions to DataFusion,
> > > > perhaps even allowing other projects such as Ballista to benefit from
> > > > Spark-compatible operators and expressions in the future.
> > > > - Having another project that uses DataFusion as a dependency could
> help
> > > > with stabilizing the public APIs and generally driving more
> innovation.
> > > >
> > > > Given these points, I would be supportive of a donation. I see it as
> > > being
> > > > similar to the Ballista project, which is already part of Arrow (and
> we
> > > > plan to move along with DataFusion once it becomes a top-level
> project).
> > > >
> > > > Thanks,
> > > >
> > > > Andy.
> > > >
> > > > On Wed, Jan 10, 2024 at 2:28 PM Chao Sun <sunc...@apache.org> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > We have been working on a native execution engine for Apache Spark
> > > > > that is heavily based on DataFusion and Arrow. Our goal is to
> > > > > accelerate Spark query execution via delegating Spark's physical
> plan
> > > > > execution to DataFusion's highly modular execution framework, while
> > > > > still maintaining the same semantics to Spark users (i.e., no Spark
> > > > > behavior change from the end users' point of view). Several of us
> are
> > > > > Spark and/or Arrow committers. At the moment, the project is under
> > > > > active development and not yet feature complete. However, some of
> the
> > > > > existing functionalities are relatively mature and have been put in
> > > > > production for a while now.
> > > > >
> > > > > Given the current momentum towards accelerating Spark through
> native
> > > > > vectorized execution, we believe open sourcing this work will
> benefit
> > > > > other Spark users too. In addition, we think the project itself can
> > > > > also leverage the vibrant and strong community behind Arrow and
> > > > > DataFusion, and grow faster. Because of this, we are exploring the
> > > > > possibility of contributing this project to the Apache Software
> > > > > Foundation (ASF) under the Apache Arrow project umbrella.
> > > > >
> > > > > We'd very much like to hear your opinion on this. Thanks.
> > > > >
> > > > > Best,
> > > > > Chao
> > > > >
> > > >
> > >
>

Re: [DISCUSS] Donation of a Spark native engine based on DataFusion & Arrow

Reply via email to