Hi Chao, This sounds like a really interesting project. I am interested in seeing how it compares to Spark RAPIDS (the project that I work on at NVIDIA) and Intel's Gluten project (that works with Velox).
I can see the following benefits of having this project being under Apache Arrow governance: - Assuming that this is a drop-in replacement that doesn't require users to change their code (as I imagine is the case), then it could lead to greater adoption of DataFusion, especially for more demanding use cases where processing on a single node is not possible. - Given that it has a deep integration with the Rust implementation of Arrow as well as DataFusion, and given the overlap of committers between these projects, having them under the same governance and communication channels will generally be more efficient than if this project is separate. - Hopefully this leads to more upstream contributions to DataFusion, perhaps even allowing other projects such as Ballista to benefit from Spark-compatible operators and expressions in the future. - Having another project that uses DataFusion as a dependency could help with stabilizing the public APIs and generally driving more innovation. Given these points, I would be supportive of a donation. I see it as being similar to the Ballista project, which is already part of Arrow (and we plan to move along with DataFusion once it becomes a top-level project). Thanks, Andy. On Wed, Jan 10, 2024 at 2:28 PM Chao Sun <sunc...@apache.org> wrote: > Hi all, > > We have been working on a native execution engine for Apache Spark > that is heavily based on DataFusion and Arrow. Our goal is to > accelerate Spark query execution via delegating Spark's physical plan > execution to DataFusion's highly modular execution framework, while > still maintaining the same semantics to Spark users (i.e., no Spark > behavior change from the end users' point of view). Several of us are > Spark and/or Arrow committers. At the moment, the project is under > active development and not yet feature complete. However, some of the > existing functionalities are relatively mature and have been put in > production for a while now. > > Given the current momentum towards accelerating Spark through native > vectorized execution, we believe open sourcing this work will benefit > other Spark users too. In addition, we think the project itself can > also leverage the vibrant and strong community behind Arrow and > DataFusion, and grow faster. Because of this, we are exploring the > possibility of contributing this project to the Apache Software > Foundation (ASF) under the Apache Arrow project umbrella. > > We'd very much like to hear your opinion on this. Thanks. > > Best, > Chao >