Hello,

As we have discussed[1][2] I would like to vote on the proposal to
create a new Apache Top Level Project for DataFusion. The text of the
proposed resolution and background document is copy/pasted below

If the community is in favor of this, we plan to submit the resolution
to the ASF board for approval with the next Arrow report (for the
April 2024 board meeting).

The vote will be open for at least 7 days.

[ ] +1 Accept this Proposal
[ ] +0
[ ] -1 Do not accept this proposal because...

Andrew

[1] https://lists.apache.org/thread/c150t1s1x0kcb3r03cjyx31kqs5oc341
[2] https://github.com/apache/arrow-datafusion/discussions/6475

---------- Proposed Resolution ---------

Resolution to Create the Apache DataFusion Project from the Apache
Arrow DataFusion Sub Project

=============================================================

X. Establish the Apache DataFusion Project

WHEREAS, the Board of Directors deems it to be in the best
interests of the Foundation and consistent with the
Foundation's purpose to establish a Project Management
Committee charged with the creation and maintenance of
open-source software related to an extensible query engine
for distribution at no charge to the public.

NOW, THEREFORE, BE IT RESOLVED, that a Project Management
Committee (PMC), to be known as the "Apache DataFusion Project",
be and hereby is established pursuant to Bylaws of the
Foundation; and be it further

RESOLVED, that the Apache DataFusion Project be and hereby is
responsible for the creation and maintenance of software
related to an extensible query engine; and be it further

RESOLVED, that the office of "Vice President, Apache DataFusion" be
and hereby is created, the person holding such office to
serve at the direction of the Board of Directors as the chair
of the Apache DataFusion Project, and to have primary responsibility
for management of the projects within the scope of
responsibility of the Apache DataFusion Project; and be it further

RESOLVED, that the persons listed immediately below be and
hereby are appointed to serve as the initial members of the
Apache DataFusion Project:

* Andy Grove (agr...@apache.org)
* Andrew Lamb (al...@apache.org)
* Daniël Heres (dhe...@apache.org)
* Jie Wen (jake...@apache.org)
* Kun Liu (liu...@apache.org)
* Liang-Chi Hsieh (vii...@apache.org)
* Qingping Hou: (ho...@apache.org)
* Wes McKinney(w...@apache.org)
* Will Jones (wjones...@apache.org)

RESOLVED, that the Apache DataFusion Project be and hereby
is tasked with the migration and rationalization of the Apache
Arrow DataFusion sub-project; and be it further

RESOLVED, that all responsibilities pertaining to the Apache
Arrow DataFusion sub-project encumbered upon the
Apache Arrow Project are hereafter discharged.

NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrew Lamb
be appointed to the office of Vice President, Apache DataFusion, to
serve in accordance with and subject to the direction of the
Board of Directors and the Bylaws of the Foundation until
death, resignation, retirement, removal or disqualification,
or until a successor is appointed.
=============================================================


-------


Summary:

We propose creating a new top level project, Apache DataFusion, from
an existing sub project of Apache Arrow to facilitate additional
community and project growth.

Abstract

Apache Arrow DataFusion[1]  is a very fast, extensible query engine
for building high-quality data-centric systems in Rust, using the
Apache Arrow in-memory format. DataFusion offers SQL and Dataframe
APIs, excellent performance, built-in support for CSV, Parquet, JSON,
and Avro, extensive customization, and a great community.

[1] https://arrow.apache.org/datafusion/


Proposal

We propose creating a new top level ASF project, Apache DataFusion,
governed initially by a subset of the Apache Arrow project’s PMC and
committers. The project’s code is in five existing git repositories,
currently governed by Apache Arrow which would transfer to the new top
level project.

Background

When DataFusion was initially donated to the Arrow project, it did not
have a strong enough community to stand on its own. It has since grown
significantly, and benefited immensely from being part of Arrow and
nurturing of the Apache Way, and now has a community strong enough to
stand on its own and that would benefit from focused governance
attention.

The community has discussed this idea publicly for more than 6 months
https://github.com/apache/arrow-datafusion/discussions/6475  and
briefly on the Arrow PMC mailing list
https://lists.apache.org/thread/thv2jdm6640l6gm88hy8jhk5prjww0cs. As
of the time of this writing both had exclusively positive reactions.

Several current members of the Arrow PMC are both active contributors
to DataFusion and understand and believe deeply in the Apache Way, and
play active governance roles in the Arrow project as PMC members and
PMC chairs, guiding the community, and releasing software versions.
With this existing governance experience and structure, the new top
level project will be able to function well immediately and
independently.

Overview of DataFusion

Current Status

Meritocracy

DataFusion has been developed as part of Apache Arrow and thus has
been operating as a meritocracy. Many of the developers of DataFusion
are Arrow PMC members or committers. The DataFusion project plans to
continue adding new PMC and committers as the project matures and
grows.

Community

The DataFusion development team seeks to foster the development and
user communities. We hope that becoming a separate project will help
both Arrow and DataFusion communities by being more focused.  Focused
governance will make it easier to grow the community of committers and
PMC members and make the organization more clear to others.

Alignment

The ASF is a natural host for DataFusion given that it is already the
home of Arrow, Parquet, and other related distributed system, storage
and query execution systems.

Project Leadership

Proposed Initial PMC

We propose the following people as the initial DataFusion PMC members.
This is a subset of the existing Arrow PMC members who contribute to
DataFusion https://people.apache.org/phonebook.html?unix=arrow

Andy Grove (agrove):  Arrow PMC Chair
Andrew Lamb (alamb): Arrow PMC, past Arrow PMC Chair
Daniël Heres (dheres) Arrow PMC
Jie Wen (jakevin):  Arrow PMC, Doris Committer
Kun Liu (liukun): Arrow PMC, IoTDB PMC, TSFile PMC
Liang-Chi Hsieh (viirya): Arrow PMC, Spark PMC
Qingping Hou: (houqp): Arrow PMC
Wes McKinney(wesm): Arrow PMC, ASF Member
Will Jones (wjones127): Arrow PMC

We’d like to propose Andrew Lamb as the initial Chair, (and thus ASF
VP) for the DataFusion project.

Affiliations

Andy Grove (agrove):  NVidia
Andrew Lamb (alamb): InfluxData
Daniël Heres (dheres): Coralogix
Jie Wen (jakevin): SelectDB
Kun Liu (liukun): Ebay
Liang-Chi Hsieh (viirya): Apple
Qingping Hou: (houqp): Scribd
Wes McKinney(wesm): Posit
Will Jones (wjones127): LanceDB

Proposed Initial Committers

In addition to the PMC, we propose the following people as the initial
DataFusion committers. This is a subset of the existing Arrow
committers who contribute to DataFusion
https://people.apache.org/phonebook.html?unix=arrow

akurmustafa Mustafa Akur (Synnada)
avantgardner Brent Gardner (Coralogix)
comphead Oleks V. (Unaffiliated)
jayzhan Jay Zhan (Unaffiliated)
jeffreyvo Jeffry Vo (Unaffiliated)
jiayuliu Liu Jiayu (Airbnb)
mete Metehan Yildirim (Synnada)
mingmwang Wang Mingming (Ebay)
mneumann Marco Neumann (InfluxData)
nju_yaho Zhong Yanghong (Ebay)
ozankabak Mehmet Ozan Kabak (Synnada)
paddyhoran Paddy Horan (Assured Allies)
rdettai Rémi Dettai (Cloudfuse)
sunchao Chao Sun (Apple)
thinkharderdev Daniel Harris (Coralogix)
tustvold Raphael Taylor-Davies (InfluxData)
wayne Ruihang Xia (Greptime)
xudong963 Xudong Wang (ByteDance)
yjshen Yijie Shen (Space and Time)
yangjiang Yang Jiang (ebay)


Risk Assessments

Naming / Trademarks

As a sub-project of Arrow, the DataFusion name has been used for over
4 years without any known issues. A podling name search did not turn
up any concerns and was approved:
https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-219

Legal / IP Clearance

All DataFusion code has either been donated to the Arrow project with
appropriate IP clearance or  has been developed directly under ASF
processes and procedures. Thus creating a new top level project poses
no new Legal or IP risks.

Code Extraction

The relevant code is already in 5 separate repositories:
https://github.com/apache/arrow-datafusion/
https://github.com/apache/arrow-datafusion-python
https://github.com/apache/arrow-ballista
https://github.com/apache/arrow-ballista-python
https://github.com/apache/arrow-datafusion-comet

We foresee no issues with code extraction and propose these
repositories be  renamed to reflect top level projects

Note:  https://github.com/apache/arrow-rs, the Rust implementation of
Arrow, would remain part of the Arrow project.

Orphaned Products

DataFusion is known to be used in many open source and commercial
projects 
https://arrow.apache.org/datafusion/user-guide/introduction.html#known-users,
has had multiple commits daily for several years, and its adoption and
number of contributors appears to be growing. We do not foresee the
project being orphaned in the next several years.

Inexperience with Open Source

The proposed PMC has extensive experience with Apache Arrow and other
Apache projects, and includes PMC members, PMC chairs and an ASF
Member. The DataFusion PMC and more experienced committers will
continue to coach new community members who may be less familiar with
the Apache Way.

Homogeneous Developers

The 9 proposed PMC members are from 9 different employers and the
proposed committers are similarly distributed across affiliations. No
specific entity employs more than 3 total proposed developers.

Reliance on Salaried Developers

A substantial amount of work on DataFusion has been by salaried
developers, but it also has a long tradition of attracting
contributions from students and hobbyists and we plan no changes in
contribution structure.

Relationships with Other Apache Products

DataFusion will obviously have a strong relationship with the Arrow
project given the overlap in people. We don’t foresee close
collaboration with other projects at this time.

Cryptography

DataFusion does not directly support encryption and there are no
near-term plans to add support for encryption. Users who need this
functionality can use the extension APIs.

Required Resources

Mailing Lists

- priv...@datafusion.apache.org for private PMC discussions (with
moderated subscriptions)
- d...@datafusion.apache.org
- comm...@datafusion.apache.org
- u...@datafusion.apache.org

Version Control

We propose to continue to use git for source control and github for
hosting and testing resources.

We also need to rename the github repositories to reflect the new top
level names:

https://github.com/apache/arrow-datafusion/ → apache/datafusion
https://github.com/apache/arrow-datafusion-python → apache/datafusion-python
https://github.com/apache/arrow-ballista → apache/datafusion-ballista
https://github.com/apache/arrow-ballista-python  →
apache/datafusion-ballista-python
https://github.com/apache/arrow-datafusion-comet → apache/datafusion-comet



Issue Tracking

DataFusion would continue to use github for its issue tracking and
communications

Other Resources

The existing repositories already make use of existing Apache
infrastructure, and we expect no change in the initial resource usage.
As the project continues to grow, we expect continued infrastructure
demand growth.


FAQ: Has a sub project been promoted to a top level project before?

Yes, and it appears to happen commonly. The Arrow project itself was
created as a top level project from work that started in Apache Drill,
and there are many sub projects of Hadoop that spun out as their own
top level projects such as Mahout, Avro and HBase:
https://news.apache.org/foundation/entry/the_apache_software_foundation_announces4



Related material:
Name search request / research for DataFusion:
https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-219
Discussion about this proposal on the arrow mailing list:
https://lists.apache.org/thread/c150t1s1x0kcb3r03cjyx31kqs5oc341
Discussion about which repositories on the arrow mailing list:
https://lists.apache.org/thread/ob3n0d9ky0bgrryl3xn39w9k566bq00q
Discussion about initial PMC on the arrow mailing list:
https://lists.apache.org/thread/pymrzcdw4qdptvby85f69rg3pcckl15b
Discussion in github about creating a new DataFusion top level
project: https://github.com/apache/arrow-datafusion/discussions/6475
Discussion about graduating on incubator list:
https://lists.apache.org/thread/r4n73pmms1lv0jbohyx1o1z13d615t99
Original Proposal for the Arrow project:
https://lists.apache.org/thread/x2qzdwglm8pkqp9gv03bbgw17khl7pq3

Reply via email to