Hi Chao, I have created https://github.com/apache/arrow-datafusion-comet and you should be able to create a PR against the repo.
Thanks, Andy. Andy. On Fri, Jan 12, 2024 at 3:45 PM Chao Sun <sunc...@apache.org> wrote: > Thanks all for the positive support! > > Andy, we plan to name the project Comet (BTW if you have better > suggestions please let us know). Could you help to create a repo named > arrow-datafusion-comet or arrow-comet? We'll clean up our internal > repo and prepare for the donation in the next few days. Thanks for the > help! > > Best, > Chao > > > > On Fri, Jan 12, 2024 at 7:09 AM Andy Grove <andygrov...@gmail.com> wrote: > > > > I think the next step here would be to create a new repo so that Chao can > > create a PR for the contribution, and then we can proceed to a vote. > > > > Chao - do you have a proposal for the name of the project? Given that > this > > is being donated to Apache Arrow, the repo name will start with "arrow-". > > Also, given that this is more of a DataFusion sub-project, I think it > would > > make sense to prefix the repo name with "arrow-datafusion-" and then > rename > > to "datafusion-" once we move the DataFusion projects to the new > top-level > > project. > > > > If the vote passes, we must complete the IP clearance process before the > PR > > is accepted [1]. > > > > [1] https://incubator.apache.org/ip-clearance/ > > > > > > > > On Fri, Jan 12, 2024 at 12:36 AM Albert <zinki...@gmail.com> wrote: > > > > > Like Andrew Lamb mentioned, blaze-rs has similar goals, I'd really be > > > interested to know some comparisons when the donations are made. > > > All in all, I look forward to the new native project for spark > > > acceleration. > > > > > > On Thu, Jan 11, 2024 at 9:50 PM Andrew Lamb <al...@influxdata.com> > wrote: > > > > > > > I am very supportive of this donation. I know of at least one other > > > > DataFusion-based project, blaze-rs[1], which has the same design > goal and > > > > bringing this project into the ASF may help consolidate these efforts > > > > > > > > As Andy said, I believe it was very valuable to have a major consumer > > > > project (e.g. DataFusion) to help drive the definition and > implementation > > > > of arrow-rs implementation. We never achieved the same synergy with > > > > Ballista and DataFusion but I think it is more likely with a more > > > actively > > > > maintained Spark accelerator. > > > > > > > > I am not sure it affects this discussion, but the Gluten project, > based > > > on > > > > Velox, was accepted yesterday[2] into the Apache Incubator[2]. > While the > > > > functionality may be similar, the technology (Rust vs C/C++) and the > > > > communities are different so having both in the same (big) tent of > the > > > ASF > > > > doesn't seem concerning to me. > > > > > > > > Also, as Chao says, I think this new sub project would naturally > move to > > > a > > > > new DataFusion top level project when we get there (we plan a > proposed > > > > resolution April ASF board meeting) > > > > > > > > Looking forward to seeing more! > > > > Andrew > > > > > > > > [1]: https://github.com/blaze-init/blaze > > > > [2]: > https://lists.apache.org/thread/6lrozds10jn9gknj9rf74lqbh7j55pq6 > > > > > > > > On Wed, Jan 10, 2024 at 5:10 PM Andy Grove <andygrov...@gmail.com> > > > wrote: > > > > > > > > > Hi Chao, > > > > > > > > > > This sounds like a really interesting project. I am interested in > > > seeing > > > > > how it compares to Spark RAPIDS (the project that I work on at > NVIDIA) > > > > and > > > > > Intel's Gluten project (that works with Velox). > > > > > > > > > > I can see the following benefits of having this project being under > > > > Apache > > > > > Arrow governance: > > > > > > > > > > - Assuming that this is a drop-in replacement that doesn't require > > > users > > > > to > > > > > change their code (as I imagine is the case), then it could lead to > > > > greater > > > > > adoption of DataFusion, especially for more demanding use cases > where > > > > > processing on a single node is not possible. > > > > > - Given that it has a deep integration with the Rust > implementation of > > > > > Arrow as well as DataFusion, and given the overlap of committers > > > between > > > > > these projects, having them under the same governance and > communication > > > > > channels will generally be more efficient than if this project is > > > > separate. > > > > > - Hopefully this leads to more upstream contributions to > DataFusion, > > > > > perhaps even allowing other projects such as Ballista to benefit > from > > > > > Spark-compatible operators and expressions in the future. > > > > > - Having another project that uses DataFusion as a dependency could > > > help > > > > > with stabilizing the public APIs and generally driving more > innovation. > > > > > > > > > > Given these points, I would be supportive of a donation. I see it > as > > > > being > > > > > similar to the Ballista project, which is already part of Arrow > (and we > > > > > plan to move along with DataFusion once it becomes a top-level > > > project). > > > > > > > > > > Thanks, > > > > > > > > > > Andy. > > > > > > > > > > On Wed, Jan 10, 2024 at 2:28 PM Chao Sun <sunc...@apache.org> > wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > We have been working on a native execution engine for Apache > Spark > > > > > > that is heavily based on DataFusion and Arrow. Our goal is to > > > > > > accelerate Spark query execution via delegating Spark's physical > plan > > > > > > execution to DataFusion's highly modular execution framework, > while > > > > > > still maintaining the same semantics to Spark users (i.e., no > Spark > > > > > > behavior change from the end users' point of view). Several of > us are > > > > > > Spark and/or Arrow committers. At the moment, the project is > under > > > > > > active development and not yet feature complete. However, some > of the > > > > > > existing functionalities are relatively mature and have been put > in > > > > > > production for a while now. > > > > > > > > > > > > Given the current momentum towards accelerating Spark through > native > > > > > > vectorized execution, we believe open sourcing this work will > benefit > > > > > > other Spark users too. In addition, we think the project itself > can > > > > > > also leverage the vibrant and strong community behind Arrow and > > > > > > DataFusion, and grow faster. Because of this, we are exploring > the > > > > > > possibility of contributing this project to the Apache Software > > > > > > Foundation (ASF) under the Apache Arrow project umbrella. > > > > > > > > > > > > We'd very much like to hear your opinion on this. Thanks. > > > > > > > > > > > > Best, > > > > > > Chao > > > > > > > > > > > > > > > > > > > > > > > > -- > > > ~~~~~~~~~~~~~~~ > > > no mistakes > > > ~~~~~~~~~~~~~~~~~~ > > > >