Full disclosure: I worked on the original value vector implementation that became Apache arrow and currently work with Chao, et al on the native engine that is being discussed. I believe that integration of DataFusion with Spark will drive both development and user interest in arrow-rs and DataFusion. Personally, I find the idea of this contribution driving increased interest in DataFusion and Ballista very exciting. Given projects like Gluten, blaze-rs, and of course RAPIDS, it would be great to get the community involved in benchmarking, and comparing the various implementations.
On Thu, Jan 11, 2024 at 5:51 AM Andrew Lamb <al...@influxdata.com> wrote: > I am very supportive of this donation. I know of at least one other > DataFusion-based project, blaze-rs[1], which has the same design goal and > bringing this project into the ASF may help consolidate these efforts > > As Andy said, I believe it was very valuable to have a major consumer > project (e.g. DataFusion) to help drive the definition and implementation > of arrow-rs implementation. We never achieved the same synergy with > Ballista and DataFusion but I think it is more likely with a more actively > maintained Spark accelerator. > > I am not sure it affects this discussion, but the Gluten project, based on > Velox, was accepted yesterday[2] into the Apache Incubator[2]. While the > functionality may be similar, the technology (Rust vs C/C++) and the > communities are different so having both in the same (big) tent of the ASF > doesn't seem concerning to me. > > Also, as Chao says, I think this new sub project would naturally move to a > new DataFusion top level project when we get there (we plan a proposed > resolution April ASF board meeting) > > Looking forward to seeing more! > Andrew > > [1]: https://github.com/blaze-init/blaze > [2]: https://lists.apache.org/thread/6lrozds10jn9gknj9rf74lqbh7j55pq6 > > On Wed, Jan 10, 2024 at 5:10 PM Andy Grove <andygrov...@gmail.com> wrote: > > > Hi Chao, > > > > This sounds like a really interesting project. I am interested in seeing > > how it compares to Spark RAPIDS (the project that I work on at NVIDIA) > and > > Intel's Gluten project (that works with Velox). > > > > I can see the following benefits of having this project being under > Apache > > Arrow governance: > > > > - Assuming that this is a drop-in replacement that doesn't require users > to > > change their code (as I imagine is the case), then it could lead to > greater > > adoption of DataFusion, especially for more demanding use cases where > > processing on a single node is not possible. > > - Given that it has a deep integration with the Rust implementation of > > Arrow as well as DataFusion, and given the overlap of committers between > > these projects, having them under the same governance and communication > > channels will generally be more efficient than if this project is > separate. > > - Hopefully this leads to more upstream contributions to DataFusion, > > perhaps even allowing other projects such as Ballista to benefit from > > Spark-compatible operators and expressions in the future. > > - Having another project that uses DataFusion as a dependency could help > > with stabilizing the public APIs and generally driving more innovation. > > > > Given these points, I would be supportive of a donation. I see it as > being > > similar to the Ballista project, which is already part of Arrow (and we > > plan to move along with DataFusion once it becomes a top-level project). > > > > Thanks, > > > > Andy. > > > > On Wed, Jan 10, 2024 at 2:28 PM Chao Sun <sunc...@apache.org> wrote: > > > > > Hi all, > > > > > > We have been working on a native execution engine for Apache Spark > > > that is heavily based on DataFusion and Arrow. Our goal is to > > > accelerate Spark query execution via delegating Spark's physical plan > > > execution to DataFusion's highly modular execution framework, while > > > still maintaining the same semantics to Spark users (i.e., no Spark > > > behavior change from the end users' point of view). Several of us are > > > Spark and/or Arrow committers. At the moment, the project is under > > > active development and not yet feature complete. However, some of the > > > existing functionalities are relatively mature and have been put in > > > production for a while now. > > > > > > Given the current momentum towards accelerating Spark through native > > > vectorized execution, we believe open sourcing this work will benefit > > > other Spark users too. In addition, we think the project itself can > > > also leverage the vibrant and strong community behind Arrow and > > > DataFusion, and grow faster. Because of this, we are exploring the > > > possibility of contributing this project to the Apache Software > > > Foundation (ASF) under the Apache Arrow project umbrella. > > > > > > We'd very much like to hear your opinion on this. Thanks. > > > > > > Best, > > > Chao > > > > > >