For those that are interested wrt lang types/lines... -------------------------------------------------------------------------------- Language files blank comment code -------------------------------------------------------------------------------- Rust 69 2701 2548 17154 Scala 69 2098 2595 12991 Java 41 926 1521 5505 Maven 4 71 156 1228 Protocol Buffers 3 96 65 417 XML 3 80 99 256 Markdown 5 69 80 190 TOML 2 14 38 90 Bourne Shell 1 9 39 65 make 1 5 1 62 Bourne Again Shell 1 12 16 56 YAML 2 5 38 34 Properties 2 8 42 26 SQL 1 0 0 9 -------------------------------------------------------------------------------- SUM: 204 6094 7238 38083 --------------------------------------------------------------------------------
On Wed, Jan 24, 2024 at 8:30 AM Chao Sun <sunc...@apache.org> wrote: > Thanks Jacques and everyone here for the feedback! We just created a > PR https://github.com/apache/arrow-datafusion-comet/pull/1 for the > donation vote and IP clearance. Please take a look there and provide > your valuable comments. > > Best, > Chao > > On Thu, Jan 18, 2024 at 5:24 PM Jacques Nadeau <jacq...@apache.org> wrote: > > > > Yes, that was roughly what I was requesting (I was suggesting a single PR > > with many commits that would be merged with the history). > > > > It's hard to provide a more concrete opinion on this without seeing the > > quantity and complexity of the code. If it's 5,000 lines of code, it > > probably doesn't matter. If it's 500,000, it's probably pretty important. > > If 10 active Arrow/Datafusion committers are already substantial > > contributors to the code also makes a difference versus only a fairly > > disjunct collection of people who are relatively inactive Arrow community > > members. > > > > Don't take this as lack of excitement! The potential for contribution is > > awesome and exciting! > > > > Part of making the contribution successful is making it as approachable > as > > possible to the rest of the community. I just want to find every way > > possible that we can do that. > > > > Looking forward to seeing the code. > > > > On Wed, Jan 17, 2024 at 10:13 AM Chao Sun <sunc...@apache.org> wrote: > > > > > Hi Jacques, > > > > > > Do you mean instead of a single PR, we modify (e.g., git commit amend) > > > all the commits that we have internally to remove any sensitive > > > information, and open PRs for them against the above repo? > > > > > > I understand this will help readability and maintenance of the code, > > > but it will be a lot of work (we have ~1000 commits) and much more > > > difficult to pass our legal review (our company has pretty strict > > > policies in open source and all the commits need to be checked before > > > they can go outside). In addition, we already carefully added plenty > > > of comments in the codebase for things that require non-trivial > > > efforts to understand. > > > > > > Given that all of our team members will be actively maintaining and > > > contributing to this project (since it's being widely used internally > > > already), we'd be happy to help further improve readability & > > > maintainability of the codebase and resolving issues raised from the > > > community. Will this work for you? really appreciate if you understand > > > our situation. > > > > > > Thanks, > > > Chao > > > > > > On Wed, Jan 17, 2024 at 11:30 AM Jacques Nadeau <jacq...@apache.org> > > > wrote: > > > > > > > > Thanks for the quick response Chao. > > > > > > > > My experience on these things is that maintaining commit history for > > > large > > > > codebases can be invaluable for tracking down issues. (Hey, why is > this > > > > code written this way-- oh, it was part of x patch that was trying to > > > > achieve y). > > > > > > > > In the past, I've used git commit replay type tools and filtering of > > > commit > > > > messages, subdirectories, etc. to get something prepped for external > > > > consumption. My experience is that spending a few days now to do this > > > kind > > > > of thing saves far more days in the future (and leads to higher > quality). > > > > > > > > On Wed, Jan 17, 2024 at 9:18 AM Chao Sun <sunc...@apache.org> wrote: > > > > > > > > > Hi Andy and Jacques, > > > > > > > > > > Thanks for setting the repo up. Yes we are working on cleaning up > the > > > > > internal repo and preparing to open a PR in the next few days. > > > > > > > > > > It's a bit difficult to retain the original commit history in the > PR > > > > > though since some of them contain internal info which we need to > > > > > remove upon open sourcing. How about we just add a summary in the > PR > > > > > itself, and add everyone that has contributed to it as co-author to > > > > > the PR? > > > > > > > > > > Chao > > > > > > > > > > On Wed, Jan 17, 2024 at 11:09 AM Jacques Nadeau < > jacq...@apache.org> > > > > > wrote: > > > > > > > > > > > > Hey Chao, it would be great for you to share the code some place > with > > > > > > commit history. (PR to the repo that Andy made or something > else.) > > > > > > > > > > > > On Mon, Jan 15, 2024 at 7:38 AM Andy Grove < > andygrov...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > Hi Chao, > > > > > > > > > > > > > > I have created > https://github.com/apache/arrow-datafusion-comet > > > and > > > > > you > > > > > > > should be able to create a PR against the repo. > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > Andy. > > > > > > > > > > > > > > Andy. > > > > > > > > > > > > > > On Fri, Jan 12, 2024 at 3:45 PM Chao Sun <sunc...@apache.org> > > > wrote: > > > > > > > > > > > > > > > Thanks all for the positive support! > > > > > > > > > > > > > > > > Andy, we plan to name the project Comet (BTW if you have > better > > > > > > > > suggestions please let us know). Could you help to create a > repo > > > > > named > > > > > > > > arrow-datafusion-comet or arrow-comet? We'll clean up our > > > internal > > > > > > > > repo and prepare for the donation in the next few days. > Thanks > > > for > > > > > the > > > > > > > > help! > > > > > > > > > > > > > > > > Best, > > > > > > > > Chao > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Jan 12, 2024 at 7:09 AM Andy Grove < > > > andygrov...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > I think the next step here would be to create a new repo so > > > that > > > > > Chao > > > > > > > can > > > > > > > > > create a PR for the contribution, and then we can proceed > to a > > > > > vote. > > > > > > > > > > > > > > > > > > Chao - do you have a proposal for the name of the project? > > > Given > > > > > that > > > > > > > > this > > > > > > > > > is being donated to Apache Arrow, the repo name will start > with > > > > > > > "arrow-". > > > > > > > > > Also, given that this is more of a DataFusion sub-project, > I > > > think > > > > > it > > > > > > > > would > > > > > > > > > make sense to prefix the repo name with > "arrow-datafusion-" and > > > > > then > > > > > > > > rename > > > > > > > > > to "datafusion-" once we move the DataFusion projects to > the > > > new > > > > > > > > top-level > > > > > > > > > project. > > > > > > > > > > > > > > > > > > If the vote passes, we must complete the IP clearance > process > > > > > before > > > > > > > the > > > > > > > > PR > > > > > > > > > is accepted [1]. > > > > > > > > > > > > > > > > > > [1] https://incubator.apache.org/ip-clearance/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Jan 12, 2024 at 12:36 AM Albert < > zinki...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > > > > > Like Andrew Lamb mentioned, blaze-rs has similar goals, > I'd > > > > > really be > > > > > > > > > > interested to know some comparisons when the donations > are > > > made. > > > > > > > > > > All in all, I look forward to the new native project for > > > spark > > > > > > > > > > acceleration. > > > > > > > > > > > > > > > > > > > > On Thu, Jan 11, 2024 at 9:50 PM Andrew Lamb < > > > > > al...@influxdata.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > I am very supportive of this donation. I know of at > least > > > one > > > > > other > > > > > > > > > > > DataFusion-based project, blaze-rs[1], which has the > same > > > > > design > > > > > > > > goal and > > > > > > > > > > > bringing this project into the ASF may help consolidate > > > these > > > > > > > efforts > > > > > > > > > > > > > > > > > > > > > > As Andy said, I believe it was very valuable to have a > > > major > > > > > > > consumer > > > > > > > > > > > project (e.g. DataFusion) to help drive the definition > and > > > > > > > > implementation > > > > > > > > > > > of arrow-rs implementation. We never achieved the same > > > synergy > > > > > with > > > > > > > > > > > Ballista and DataFusion but I think it is more likely > with > > > a > > > > > more > > > > > > > > > > actively > > > > > > > > > > > maintained Spark accelerator. > > > > > > > > > > > > > > > > > > > > > > I am not sure it affects this discussion, but the > Gluten > > > > > project, > > > > > > > > based > > > > > > > > > > on > > > > > > > > > > > Velox, was accepted yesterday[2] into the Apache > > > Incubator[2]. > > > > > > > > While the > > > > > > > > > > > functionality may be similar, the technology (Rust vs > > > C/C++) > > > > > and > > > > > > > the > > > > > > > > > > > communities are different so having both in the same > (big) > > > > > tent of > > > > > > > > the > > > > > > > > > > ASF > > > > > > > > > > > doesn't seem concerning to me. > > > > > > > > > > > > > > > > > > > > > > Also, as Chao says, I think this new sub project would > > > > > naturally > > > > > > > > move to > > > > > > > > > > a > > > > > > > > > > > new DataFusion top level project when we get there (we > > > plan a > > > > > > > > proposed > > > > > > > > > > > resolution April ASF board meeting) > > > > > > > > > > > > > > > > > > > > > > Looking forward to seeing more! > > > > > > > > > > > Andrew > > > > > > > > > > > > > > > > > > > > > > [1]: https://github.com/blaze-init/blaze > > > > > > > > > > > [2]: > > > > > > > > > https://lists.apache.org/thread/6lrozds10jn9gknj9rf74lqbh7j55pq6 > > > > > > > > > > > > > > > > > > > > > > On Wed, Jan 10, 2024 at 5:10 PM Andy Grove < > > > > > andygrov...@gmail.com> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi Chao, > > > > > > > > > > > > > > > > > > > > > > > > This sounds like a really interesting project. I am > > > > > interested in > > > > > > > > > > seeing > > > > > > > > > > > > how it compares to Spark RAPIDS (the project that I > work > > > on > > > > > at > > > > > > > > NVIDIA) > > > > > > > > > > > and > > > > > > > > > > > > Intel's Gluten project (that works with Velox). > > > > > > > > > > > > > > > > > > > > > > > > I can see the following benefits of having this > project > > > being > > > > > > > under > > > > > > > > > > > Apache > > > > > > > > > > > > Arrow governance: > > > > > > > > > > > > > > > > > > > > > > > > - Assuming that this is a drop-in replacement that > > > doesn't > > > > > > > require > > > > > > > > > > users > > > > > > > > > > > to > > > > > > > > > > > > change their code (as I imagine is the case), then it > > > could > > > > > lead > > > > > > > to > > > > > > > > > > > greater > > > > > > > > > > > > adoption of DataFusion, especially for more > demanding use > > > > > cases > > > > > > > > where > > > > > > > > > > > > processing on a single node is not possible. > > > > > > > > > > > > - Given that it has a deep integration with the Rust > > > > > > > > implementation of > > > > > > > > > > > > Arrow as well as DataFusion, and given the overlap of > > > > > committers > > > > > > > > > > between > > > > > > > > > > > > these projects, having them under the same > governance and > > > > > > > > communication > > > > > > > > > > > > channels will generally be more efficient than if > this > > > > > project is > > > > > > > > > > > separate. > > > > > > > > > > > > - Hopefully this leads to more upstream > contributions to > > > > > > > > DataFusion, > > > > > > > > > > > > perhaps even allowing other projects such as > Ballista to > > > > > benefit > > > > > > > > from > > > > > > > > > > > > Spark-compatible operators and expressions in the > future. > > > > > > > > > > > > - Having another project that uses DataFusion as a > > > dependency > > > > > > > could > > > > > > > > > > help > > > > > > > > > > > > with stabilizing the public APIs and generally > driving > > > more > > > > > > > > innovation. > > > > > > > > > > > > > > > > > > > > > > > > Given these points, I would be supportive of a > donation. > > > I > > > > > see it > > > > > > > > as > > > > > > > > > > > being > > > > > > > > > > > > similar to the Ballista project, which is already > part of > > > > > Arrow > > > > > > > > (and we > > > > > > > > > > > > plan to move along with DataFusion once it becomes a > > > > > top-level > > > > > > > > > > project). > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > Andy. > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jan 10, 2024 at 2:28 PM Chao Sun < > > > sunc...@apache.org > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > > > > > > > We have been working on a native execution engine > for > > > > > Apache > > > > > > > > Spark > > > > > > > > > > > > > that is heavily based on DataFusion and Arrow. Our > > > goal is > > > > > to > > > > > > > > > > > > > accelerate Spark query execution via delegating > Spark's > > > > > > > physical > > > > > > > > plan > > > > > > > > > > > > > execution to DataFusion's highly modular execution > > > > > framework, > > > > > > > > while > > > > > > > > > > > > > still maintaining the same semantics to Spark users > > > (i.e., > > > > > no > > > > > > > > Spark > > > > > > > > > > > > > behavior change from the end users' point of view). > > > > > Several of > > > > > > > > us are > > > > > > > > > > > > > Spark and/or Arrow committers. At the moment, the > > > project > > > > > is > > > > > > > > under > > > > > > > > > > > > > active development and not yet feature complete. > > > However, > > > > > some > > > > > > > > of the > > > > > > > > > > > > > existing functionalities are relatively mature and > have > > > > > been > > > > > > > put > > > > > > > > in > > > > > > > > > > > > > production for a while now. > > > > > > > > > > > > > > > > > > > > > > > > > > Given the current momentum towards accelerating > Spark > > > > > through > > > > > > > > native > > > > > > > > > > > > > vectorized execution, we believe open sourcing this > > > work > > > > > will > > > > > > > > benefit > > > > > > > > > > > > > other Spark users too. In addition, we think the > > > project > > > > > itself > > > > > > > > can > > > > > > > > > > > > > also leverage the vibrant and strong community > behind > > > > > Arrow and > > > > > > > > > > > > > DataFusion, and grow faster. Because of this, we > are > > > > > exploring > > > > > > > > the > > > > > > > > > > > > > possibility of contributing this project to the > Apache > > > > > Software > > > > > > > > > > > > > Foundation (ASF) under the Apache Arrow project > > > umbrella. > > > > > > > > > > > > > > > > > > > > > > > > > > We'd very much like to hear your opinion on this. > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > Chao > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > ~~~~~~~~~~~~~~~ > > > > > > > > > > no mistakes > > > > > > > > > > ~~~~~~~~~~~~~~~~~~ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >