Re: [DISCUSS] Donation of a Spark native engine based on DataFusion & Arrow

Andy Grove Mon, 15 Jan 2024 09:37:55 -0800

Hi Chao,

I have created https://github.com/apache/arrow-datafusion-comet and you
should be able to create a PR against the repo.


Thanks,

Andy.

Andy.

On Fri, Jan 12, 2024 at 3:45 PM Chao Sun <[email protected]> wrote:

> Thanks all for the positive support!
>
> Andy, we plan to name the project Comet (BTW if you have better
> suggestions please let us know). Could you help to create a repo named
> arrow-datafusion-comet or arrow-comet? We'll clean up our internal
> repo and prepare for the donation in the next few days. Thanks for the
> help!
>
> Best,
> Chao
>
>
>
> On Fri, Jan 12, 2024 at 7:09 AM Andy Grove <[email protected]> wrote:
> >
> > I think the next step here would be to create a new repo so that Chao can
> > create a PR for the contribution, and then we can proceed to a vote.
> >
> > Chao - do you have a proposal for the name of the project? Given that
> this
> > is being donated to Apache Arrow, the repo name will start with "arrow-".
> > Also, given that this is more of a DataFusion sub-project, I think it
> would
> > make sense to prefix the repo name with "arrow-datafusion-" and then
> rename
> > to "datafusion-" once we move the DataFusion projects to the new
> top-level
> > project.
> >
> > If the vote passes, we must complete the IP clearance process before the
> PR
> > is accepted [1].
> >
> > [1] https://incubator.apache.org/ip-clearance/
> >
> >
> >
> > On Fri, Jan 12, 2024 at 12:36 AM Albert <[email protected]> wrote:
> >
> > > Like Andrew Lamb mentioned, blaze-rs has similar goals, I'd really be
> > > interested to know some comparisons when the donations are made.
> > > All in all, I look forward to the new native project for spark
> > > acceleration.
> > >
> > > On Thu, Jan 11, 2024 at 9:50 PM Andrew Lamb <[email protected]>
> wrote:
> > >
> > > > I am very supportive of this donation. I know of at least one other
> > > > DataFusion-based project, blaze-rs[1], which has the same design
> goal and
> > > > bringing this project into the ASF may help consolidate these efforts
> > > >
> > > > As Andy said, I believe it was very valuable to have a major consumer
> > > > project (e.g. DataFusion) to help drive the definition and
> implementation
> > > > of arrow-rs implementation. We never achieved the same synergy with
> > > > Ballista and DataFusion but I think it is more likely with a more
> > > actively
> > > > maintained Spark accelerator.
> > > >
> > > > I am not sure it affects this discussion, but the Gluten project,
> based
> > > on
> > > > Velox, was accepted yesterday[2] into the Apache Incubator[2].
> While the
> > > > functionality may be similar, the technology (Rust vs C/C++) and the
> > > > communities are different so having both in the same (big) tent of
> the
> > > ASF
> > > > doesn't seem concerning to me.
> > > >
> > > > Also, as Chao says, I think this new sub project would naturally
> move to
> > > a
> > > > new DataFusion top level project when we get there (we plan a
> proposed
> > > > resolution April ASF board meeting)
> > > >
> > > > Looking forward to seeing more!
> > > > Andrew
> > > >
> > > > [1]: https://github.com/blaze-init/blaze
> > > > [2]:
> https://lists.apache.org/thread/6lrozds10jn9gknj9rf74lqbh7j55pq6
> > > >
> > > > On Wed, Jan 10, 2024 at 5:10 PM Andy Grove <[email protected]>
> > > wrote:
> > > >
> > > > > Hi Chao,
> > > > >
> > > > > This sounds like a really interesting project. I am interested in
> > > seeing
> > > > > how it compares to Spark RAPIDS (the project that I work on at
> NVIDIA)
> > > > and
> > > > > Intel's Gluten project (that works with Velox).
> > > > >
> > > > > I can see the following benefits of having this project being under
> > > > Apache
> > > > > Arrow governance:
> > > > >
> > > > > - Assuming that this is a drop-in replacement that doesn't require
> > > users
> > > > to
> > > > > change their code (as I imagine is the case), then it could lead to
> > > > greater
> > > > > adoption of DataFusion, especially for more demanding use cases
> where
> > > > > processing on a single node is not possible.
> > > > > - Given that it has a deep integration with the Rust
> implementation of
> > > > > Arrow as well as DataFusion, and given the overlap of committers
> > > between
> > > > > these projects, having them under the same governance and
> communication
> > > > > channels will generally be more efficient than if this project is
> > > > separate.
> > > > > - Hopefully this leads to more upstream contributions to
> DataFusion,
> > > > > perhaps even allowing other projects such as Ballista to benefit
> from
> > > > > Spark-compatible operators and expressions in the future.
> > > > > - Having another project that uses DataFusion as a dependency could
> > > help
> > > > > with stabilizing the public APIs and generally driving more
> innovation.
> > > > >
> > > > > Given these points, I would be supportive of a donation. I see it
> as
> > > > being
> > > > > similar to the Ballista project, which is already part of Arrow
> (and we
> > > > > plan to move along with DataFusion once it becomes a top-level
> > > project).
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Andy.
> > > > >
> > > > > On Wed, Jan 10, 2024 at 2:28 PM Chao Sun <[email protected]>
> wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > We have been working on a native execution engine for Apache
> Spark
> > > > > > that is heavily based on DataFusion and Arrow. Our goal is to
> > > > > > accelerate Spark query execution via delegating Spark's physical
> plan
> > > > > > execution to DataFusion's highly modular execution framework,
> while
> > > > > > still maintaining the same semantics to Spark users (i.e., no
> Spark
> > > > > > behavior change from the end users' point of view). Several of
> us are
> > > > > > Spark and/or Arrow committers. At the moment, the project is
> under
> > > > > > active development and not yet feature complete. However, some
> of the
> > > > > > existing functionalities are relatively mature and have been put
> in
> > > > > > production for a while now.
> > > > > >
> > > > > > Given the current momentum towards accelerating Spark through
> native
> > > > > > vectorized execution, we believe open sourcing this work will
> benefit
> > > > > > other Spark users too. In addition, we think the project itself
> can
> > > > > > also leverage the vibrant and strong community behind Arrow and
> > > > > > DataFusion, and grow faster. Because of this, we are exploring
> the
> > > > > > possibility of contributing this project to the Apache Software
> > > > > > Foundation (ASF) under the Apache Arrow project umbrella.
> > > > > >
> > > > > > We'd very much like to hear your opinion on this. Thanks.
> > > > > >
> > > > > > Best,
> > > > > > Chao
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > ~~~~~~~~~~~~~~~
> > > no mistakes
> > > ~~~~~~~~~~~~~~~~~~
> > >
>

Re: [DISCUSS] Donation of a Spark native engine based on DataFusion & Arrow

Reply via email to