Re: [DISCUSS] Donation of a Spark native engine based on DataFusion & Arrow

Chao Sun Wed, 10 Jan 2024 14:04:17 -0800

Thanks Micah for the quick response.

> Would Spark itself not be a reasonable place for this work?


We considered Spark as well but decided it is a better place to be
under Arrow given the project itself heavily tied with DataFusion. A
lot of the work in this project is to convert Spark physical plan into
DataFusion's physical plan, with custom overrides when DF's semantics
are different from what Spark offers.
We also work closely with the DataFusion (as well as arrow-rs)
community to make it more modular and customizable to fit our needs.

> Do you anticipate this would move with DataFusion to its own top-level 
> project [1] if that happens or stay within the Arrow project?

Yes, we do anticipate that if the donation is successful, this project
can move along with DataFusion and become a sub-project under it. We
still want to see if we can donate it to Arrow first given it may take
months for DataFusion to become a top-level project itself.

Chao

On Wed, Jan 10, 2024 at 1:45 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Hi Chao,
> Very cool. I think this is something that a lot of people are interested
> in.  I think the main questions I have are:
> 1.  Would Spark itself not be a reasonable place for this work?
> 2.  Do you anticipate this would move with DataFusion to its own top-level
> project [1] if that happens or stay within the Arrow project?
>
> Thanks,
> Micah
>
> [1] https://lists.apache.org/thread/c150t1s1x0kcb3r03cjyx31kqs5oc341
>
> On Wed, Jan 10, 2024 at 1:28 PM Chao Sun <sunc...@apache.org> wrote:
>
> > Hi all,
> >
> > We have been working on a native execution engine for Apache Spark
> > that is heavily based on DataFusion and Arrow. Our goal is to
> > accelerate Spark query execution via delegating Spark's physical plan
> > execution to DataFusion's highly modular execution framework, while
> > still maintaining the same semantics to Spark users (i.e., no Spark
> > behavior change from the end users' point of view). Several of us are
> > Spark and/or Arrow committers. At the moment, the project is under
> > active development and not yet feature complete. However, some of the
> > existing functionalities are relatively mature and have been put in
> > production for a while now.
> >
> > Given the current momentum towards accelerating Spark through native
> > vectorized execution, we believe open sourcing this work will benefit
> > other Spark users too. In addition, we think the project itself can
> > also leverage the vibrant and strong community behind Arrow and
> > DataFusion, and grow faster. Because of this, we are exploring the
> > possibility of contributing this project to the Apache Software
> > Foundation (ASF) under the Apache Arrow project umbrella.
> >
> > We'd very much like to hear your opinion on this. Thanks.
> >
> > Best,
> > Chao
> >

Re: [DISCUSS] Donation of a Spark native engine based on DataFusion & Arrow

Reply via email to