Hello, As we have discussed[1][2] I would like to vote on the proposal to create a new Apache Top Level Project for DataFusion. The text of the proposed resolution and background document is copy/pasted below
If the community is in favor of this, we plan to submit the resolution to the ASF board for approval with the next Arrow report (for the April 2024 board meeting). The vote will be open for at least 7 days. [ ] +1 Accept this Proposal [ ] +0 [ ] -1 Do not accept this proposal because... Andrew [1] https://lists.apache.org/thread/c150t1s1x0kcb3r03cjyx31kqs5oc341 [2] https://github.com/apache/arrow-datafusion/discussions/6475 ---------- Proposed Resolution --------- Resolution to Create the Apache DataFusion Project from the Apache Arrow DataFusion Sub Project ============================================================= X. Establish the Apache DataFusion Project WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to an extensible query engine for distribution at no charge to the public. NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee (PMC), to be known as the "Apache DataFusion Project", be and hereby is established pursuant to Bylaws of the Foundation; and be it further RESOLVED, that the Apache DataFusion Project be and hereby is responsible for the creation and maintenance of software related to an extensible query engine; and be it further RESOLVED, that the office of "Vice President, Apache DataFusion" be and hereby is created, the person holding such office to serve at the direction of the Board of Directors as the chair of the Apache DataFusion Project, and to have primary responsibility for management of the projects within the scope of responsibility of the Apache DataFusion Project; and be it further RESOLVED, that the persons listed immediately below be and hereby are appointed to serve as the initial members of the Apache DataFusion Project: * Andy Grove (agr...@apache.org) * Andrew Lamb (al...@apache.org) * Daniël Heres (dhe...@apache.org) * Jie Wen (jake...@apache.org) * Kun Liu (liu...@apache.org) * Liang-Chi Hsieh (vii...@apache.org) * Qingping Hou: (ho...@apache.org) * Wes McKinney(w...@apache.org) * Will Jones (wjones...@apache.org) RESOLVED, that the Apache DataFusion Project be and hereby is tasked with the migration and rationalization of the Apache Arrow DataFusion sub-project; and be it further RESOLVED, that all responsibilities pertaining to the Apache Arrow DataFusion sub-project encumbered upon the Apache Arrow Project are hereafter discharged. NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrew Lamb be appointed to the office of Vice President, Apache DataFusion, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed. ============================================================= ------- Summary: We propose creating a new top level project, Apache DataFusion, from an existing sub project of Apache Arrow to facilitate additional community and project growth. Abstract Apache Arrow DataFusion[1] is a very fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format. DataFusion offers SQL and Dataframe APIs, excellent performance, built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community. [1] https://arrow.apache.org/datafusion/ Proposal We propose creating a new top level ASF project, Apache DataFusion, governed initially by a subset of the Apache Arrow project’s PMC and committers. The project’s code is in five existing git repositories, currently governed by Apache Arrow which would transfer to the new top level project. Background When DataFusion was initially donated to the Arrow project, it did not have a strong enough community to stand on its own. It has since grown significantly, and benefited immensely from being part of Arrow and nurturing of the Apache Way, and now has a community strong enough to stand on its own and that would benefit from focused governance attention. The community has discussed this idea publicly for more than 6 months https://github.com/apache/arrow-datafusion/discussions/6475 and briefly on the Arrow PMC mailing list https://lists.apache.org/thread/thv2jdm6640l6gm88hy8jhk5prjww0cs. As of the time of this writing both had exclusively positive reactions. Several current members of the Arrow PMC are both active contributors to DataFusion and understand and believe deeply in the Apache Way, and play active governance roles in the Arrow project as PMC members and PMC chairs, guiding the community, and releasing software versions. With this existing governance experience and structure, the new top level project will be able to function well immediately and independently. Overview of DataFusion Current Status Meritocracy DataFusion has been developed as part of Apache Arrow and thus has been operating as a meritocracy. Many of the developers of DataFusion are Arrow PMC members or committers. The DataFusion project plans to continue adding new PMC and committers as the project matures and grows. Community The DataFusion development team seeks to foster the development and user communities. We hope that becoming a separate project will help both Arrow and DataFusion communities by being more focused. Focused governance will make it easier to grow the community of committers and PMC members and make the organization more clear to others. Alignment The ASF is a natural host for DataFusion given that it is already the home of Arrow, Parquet, and other related distributed system, storage and query execution systems. Project Leadership Proposed Initial PMC We propose the following people as the initial DataFusion PMC members. This is a subset of the existing Arrow PMC members who contribute to DataFusion https://people.apache.org/phonebook.html?unix=arrow Andy Grove (agrove): Arrow PMC Chair Andrew Lamb (alamb): Arrow PMC, past Arrow PMC Chair Daniël Heres (dheres) Arrow PMC Jie Wen (jakevin): Arrow PMC, Doris Committer Kun Liu (liukun): Arrow PMC, IoTDB PMC, TSFile PMC Liang-Chi Hsieh (viirya): Arrow PMC, Spark PMC Qingping Hou: (houqp): Arrow PMC Wes McKinney(wesm): Arrow PMC, ASF Member Will Jones (wjones127): Arrow PMC We’d like to propose Andrew Lamb as the initial Chair, (and thus ASF VP) for the DataFusion project. Affiliations Andy Grove (agrove): NVidia Andrew Lamb (alamb): InfluxData Daniël Heres (dheres): Coralogix Jie Wen (jakevin): SelectDB Kun Liu (liukun): Ebay Liang-Chi Hsieh (viirya): Apple Qingping Hou: (houqp): Scribd Wes McKinney(wesm): Posit Will Jones (wjones127): LanceDB Proposed Initial Committers In addition to the PMC, we propose the following people as the initial DataFusion committers. This is a subset of the existing Arrow committers who contribute to DataFusion https://people.apache.org/phonebook.html?unix=arrow akurmustafa Mustafa Akur (Synnada) avantgardner Brent Gardner (Coralogix) comphead Oleks V. (Unaffiliated) jayzhan Jay Zhan (Unaffiliated) jeffreyvo Jeffry Vo (Unaffiliated) jiayuliu Liu Jiayu (Airbnb) mete Metehan Yildirim (Synnada) mingmwang Wang Mingming (Ebay) mneumann Marco Neumann (InfluxData) nju_yaho Zhong Yanghong (Ebay) ozankabak Mehmet Ozan Kabak (Synnada) paddyhoran Paddy Horan (Assured Allies) rdettai Rémi Dettai (Cloudfuse) sunchao Chao Sun (Apple) thinkharderdev Daniel Harris (Coralogix) tustvold Raphael Taylor-Davies (InfluxData) wayne Ruihang Xia (Greptime) xudong963 Xudong Wang (ByteDance) yjshen Yijie Shen (Space and Time) yangjiang Yang Jiang (ebay) Risk Assessments Naming / Trademarks As a sub-project of Arrow, the DataFusion name has been used for over 4 years without any known issues. A podling name search did not turn up any concerns and was approved: https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-219 Legal / IP Clearance All DataFusion code has either been donated to the Arrow project with appropriate IP clearance or has been developed directly under ASF processes and procedures. Thus creating a new top level project poses no new Legal or IP risks. Code Extraction The relevant code is already in 5 separate repositories: https://github.com/apache/arrow-datafusion/ https://github.com/apache/arrow-datafusion-python https://github.com/apache/arrow-ballista https://github.com/apache/arrow-ballista-python https://github.com/apache/arrow-datafusion-comet We foresee no issues with code extraction and propose these repositories be renamed to reflect top level projects Note: https://github.com/apache/arrow-rs, the Rust implementation of Arrow, would remain part of the Arrow project. Orphaned Products DataFusion is known to be used in many open source and commercial projects https://arrow.apache.org/datafusion/user-guide/introduction.html#known-users, has had multiple commits daily for several years, and its adoption and number of contributors appears to be growing. We do not foresee the project being orphaned in the next several years. Inexperience with Open Source The proposed PMC has extensive experience with Apache Arrow and other Apache projects, and includes PMC members, PMC chairs and an ASF Member. The DataFusion PMC and more experienced committers will continue to coach new community members who may be less familiar with the Apache Way. Homogeneous Developers The 9 proposed PMC members are from 9 different employers and the proposed committers are similarly distributed across affiliations. No specific entity employs more than 3 total proposed developers. Reliance on Salaried Developers A substantial amount of work on DataFusion has been by salaried developers, but it also has a long tradition of attracting contributions from students and hobbyists and we plan no changes in contribution structure. Relationships with Other Apache Products DataFusion will obviously have a strong relationship with the Arrow project given the overlap in people. We don’t foresee close collaboration with other projects at this time. Cryptography DataFusion does not directly support encryption and there are no near-term plans to add support for encryption. Users who need this functionality can use the extension APIs. Required Resources Mailing Lists - priv...@datafusion.apache.org for private PMC discussions (with moderated subscriptions) - d...@datafusion.apache.org - comm...@datafusion.apache.org - u...@datafusion.apache.org Version Control We propose to continue to use git for source control and github for hosting and testing resources. We also need to rename the github repositories to reflect the new top level names: https://github.com/apache/arrow-datafusion/ → apache/datafusion https://github.com/apache/arrow-datafusion-python → apache/datafusion-python https://github.com/apache/arrow-ballista → apache/datafusion-ballista https://github.com/apache/arrow-ballista-python → apache/datafusion-ballista-python https://github.com/apache/arrow-datafusion-comet → apache/datafusion-comet Issue Tracking DataFusion would continue to use github for its issue tracking and communications Other Resources The existing repositories already make use of existing Apache infrastructure, and we expect no change in the initial resource usage. As the project continues to grow, we expect continued infrastructure demand growth. FAQ: Has a sub project been promoted to a top level project before? Yes, and it appears to happen commonly. The Arrow project itself was created as a top level project from work that started in Apache Drill, and there are many sub projects of Hadoop that spun out as their own top level projects such as Mahout, Avro and HBase: https://news.apache.org/foundation/entry/the_apache_software_foundation_announces4 Related material: Name search request / research for DataFusion: https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-219 Discussion about this proposal on the arrow mailing list: https://lists.apache.org/thread/c150t1s1x0kcb3r03cjyx31kqs5oc341 Discussion about which repositories on the arrow mailing list: https://lists.apache.org/thread/ob3n0d9ky0bgrryl3xn39w9k566bq00q Discussion about initial PMC on the arrow mailing list: https://lists.apache.org/thread/pymrzcdw4qdptvby85f69rg3pcckl15b Discussion in github about creating a new DataFusion top level project: https://github.com/apache/arrow-datafusion/discussions/6475 Discussion about graduating on incubator list: https://lists.apache.org/thread/r4n73pmms1lv0jbohyx1o1z13d615t99 Original Proposal for the Arrow project: https://lists.apache.org/thread/x2qzdwglm8pkqp9gv03bbgw17khl7pq3