+1, It makes sense!

- Kousuke

(2014/11/05 17:31), Matei Zaharia wrote:
Hi all,

I wanted to share a discussion we've been having on the PMC list, as well as 
call for an official vote on it on a public list. Basically, as the Spark 
project scales up, we need to define a model to make sure there is still great 
oversight of key components (in particular internal architecture and public 
APIs), and to this end I've proposed implementing a maintainer model for some 
of these components, similar to other large projects.

As background on this, Spark has grown a lot since joining Apache. We've had 
over 80 contributors/month for the past 3 months, which I believe makes us the 
most active project in contributors/month at Apache, as well as over 500 
patches/month. The codebase has also grown significantly, with new libraries 
for SQL, ML, graphs and more.

In this kind of large project, one common way to scale development is to assign 
"maintainers" to oversee key components, where each patch to that component 
needs to get sign-off from at least one of its maintainers. Most existing large projects 
do this -- at Apache, some large ones with this model are CloudStack (the second-most 
active project overall), Subversion, and Kafka, and other examples include Linux and 
Python. This is also by-and-large how Spark operates today -- most components have a 
de-facto maintainer.

IMO, adopting this model would have two benefits:

1) Consistent oversight of design for that component, especially regarding 
architecture and API. This process would ensure that the component's 
maintainers see all proposed changes and consider them to fit together in a 
good way.

2) More structure for new contributors and committers -- in particular, it 
would be easy to look up who’s responsible for each module and ask them for 
reviews, etc, rather than having patches slip between the cracks.

We'd like to start with in a light-weight manner, where the model only applies 
to certain key components (e.g. scheduler, shuffle) and user-facing APIs 
(MLlib, GraphX, etc). Over time, as the project grows, we can expand it if we 
deem it useful. The specific mechanics would be as follows:

- Some components in Spark will have maintainers assigned to them, where one of 
the maintainers needs to sign off on each patch to the component.
- Each component with maintainers will have at least 2 maintainers.
- Maintainers will be assigned from the most active and knowledgeable 
committers on that component by the PMC. The PMC can vote to add / remove 
maintainers, and maintained components, through consensus.
- Maintainers are expected to be active in responding to patches for their 
components, though they do not need to be the main reviewers for them (e.g. 
they might just sign off on architecture / API). To prevent inactive 
maintainers from blocking the project, if a maintainer isn't responding in a 
reasonable time period (say 2 weeks), other committers can merge the patch, and 
the PMC will want to discuss adding another maintainer.

If you'd like to see examples for this model, check out the following projects:
- CloudStack: 
https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide 
<https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide>
- Subversion: https://subversion.apache.org/docs/community-guide/roles.html 
<https://subversion.apache.org/docs/community-guide/roles.html>

Finally, I wanted to list our current proposal for initial components and 
maintainers. It would be good to get feedback on other components we might add, but 
please note that personnel discussions (e.g. "I don't think Matei should 
maintain *that* component) should only happen on the private list. The initial 
components were chosen to include all public APIs and the main core components, and 
the maintainers were chosen from the most active contributors to those modules.

- Spark core public API: Matei, Patrick, Reynold
- Job scheduler: Matei, Kay, Patrick
- Shuffle and network: Reynold, Aaron, Matei
- Block manager: Reynold, Aaron
- YARN: Tom, Andrew Or
- Python: Josh, Matei
- MLlib: Xiangrui, Matei
- SQL: Michael, Reynold
- Streaming: TD, Matei
- GraphX: Ankur, Joey, Reynold

I'd like to formally call a [VOTE] on this model, to last 72 hours. The [VOTE] 
will end on Nov 8, 2014 at 6 PM PST.

Matei


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to