Re: [DISCUSS] Donation of a Spark native engine based on DataFusion & Arrow

Andy Grove Wed, 10 Jan 2024 14:10:51 -0800

Hi Chao,

This sounds like a really interesting project. I am interested in seeing
how it compares to Spark RAPIDS (the project that I work on at NVIDIA) and
Intel's Gluten project (that works with Velox).


I can see the following benefits of having this project being under Apache
Arrow governance:

- Assuming that this is a drop-in replacement that doesn't require users to
change their code (as I imagine is the case), then it could lead to greater
adoption of DataFusion, especially for more demanding use cases where
processing on a single node is not possible.
- Given that it has a deep integration with the Rust implementation of
Arrow as well as DataFusion, and given the overlap of committers between
these projects, having them under the same governance and communication
channels will generally be more efficient than if this project is separate.
- Hopefully this leads to more upstream contributions to DataFusion,
perhaps even allowing other projects such as Ballista to benefit from
Spark-compatible operators and expressions in the future.
- Having another project that uses DataFusion as a dependency could help
with stabilizing the public APIs and generally driving more innovation.

Given these points, I would be supportive of a donation. I see it as being
similar to the Ballista project, which is already part of Arrow (and we
plan to move along with DataFusion once it becomes a top-level project).

Thanks,

Andy.

On Wed, Jan 10, 2024 at 2:28 PM Chao Sun <sunc...@apache.org> wrote:

> Hi all,
>
> We have been working on a native execution engine for Apache Spark
> that is heavily based on DataFusion and Arrow. Our goal is to
> accelerate Spark query execution via delegating Spark's physical plan
> execution to DataFusion's highly modular execution framework, while
> still maintaining the same semantics to Spark users (i.e., no Spark
> behavior change from the end users' point of view). Several of us are
> Spark and/or Arrow committers. At the moment, the project is under
> active development and not yet feature complete. However, some of the
> existing functionalities are relatively mature and have been put in
> production for a while now.
>
> Given the current momentum towards accelerating Spark through native
> vectorized execution, we believe open sourcing this work will benefit
> other Spark users too. In addition, we think the project itself can
> also leverage the vibrant and strong community behind Arrow and
> DataFusion, and grow faster. Because of this, we are exploring the
> possibility of contributing this project to the Apache Software
> Foundation (ASF) under the Apache Arrow project umbrella.
>
> We'd very much like to hear your opinion on this. Thanks.
>
> Best,
> Chao
>

Re: [DISCUSS] Donation of a Spark native engine based on DataFusion & Arrow

Reply via email to