Hi Folks,
Thanks for participating in the discussion. I'd like to call a VOTE for
acceptance of Apache Tez into the Incubator. I'll let the vote run till
into this weekend (Sun 2/24 6pm PST).
[ ] +1 Accept Apache Tez into the Incubator
[ ] +0 Don't care.
[ ] -1 Don't accept Apache Tez into the Incubator because...
Full proposal is pasted at the bottom of this email, and the
corresponding wiki is http://wiki.apache.org/incubator/TezProposal.
Only VOTEs from Incubator PMC members are binding, but all are welcome to
express their thoughts.
Here's my +1 (binding).
thanks,
Arun
PS: From the initial discussion, the only changes are that I've added one
new mentor and 2 new committers. All the new additions come from the
non-major employer while we continue to strive to further diversify
during the incubation. Thanks.
----
= Tez =
== Abstract ==
Tez is an effort to develop a generic application framework which can be
used
to process arbitrarily complex data-processing tasks and also a re-usable
set
of data-processing primitives which can be used by other projects.
== Proposal ==
Tez is a proposal to develop a generic application which can be used to
process complex data-processing task DAGs and runs natively on Apache
Hadoop
YARN. YARN is a generic resource-management system on which currently
applications like MapReduce already exist. MapReduce is a specific, and
constrained, DAG - which is not optimal for several frameworks like
Apache Hive
and Apache Pig. Furthermore, we propose to develop a re-usable set of
libraries of data-processing primitives such as sorting, merging,
data-shuffling, intermediate data management etc. which are necessary for
Tez
which we envision can be used directly by other projects.
== Background ==
Apache Hadoop MapReduce has emerged as the assembly-language on which
other
frameworks like Apache Pig and Apache Hive have been built. However, it
has
been well accepted that MapReduce produces very constrained task DAGs for
each
job which results in Apache Pig and Apache Hive requiring multiple
MapReduce
jobs for several queries. By providing a more expressive DAG of tasks for
a
job, Tez attempts to provide significantly enhanced data-processing
capabilities for projects like Apache Pig, Apache Hive, Cascading etc.
== Rationale ==
There is an important gap that Tez fulfills in the Apache Hadoop
ecosystem of
allowing for more expressive task DAGs for data-processing applications
such
as Apache Pig, Apache Hive, Cascading etc.
With emergence of Apache Hadoop YARN, there is a strong need for a
common DAG application which can then be shared by Apache Pig, Apache
Hive,
Cascading etc.
== Initial Goals ==
The initial goals for this project are to specify the detailed
requirements
and architecture, and then develop the initial implementation including
the
DAG ApplicationMaster to run natively inside Apache Hadoop YARN.
== Current Status ==
Significant work has been completed to identify the initial requirements
and
define the overall system architecture. There is a patch available in the
internal Hortonworks git repository which can act as the initial seed.
=== Meritocracy ===
We plan to invest in supporting a meritocracy. We will discuss the
requirements
in an open forum. Several companies have already expressed interest in
this
project, and we intend to invite additional developers to participate.
We will encourage and monitor community participation so that privileges
can be
extended to those that contribute.
=== Community ===
The need for a generic DAG application for data processing in the open
source is
tremendous, so there is a potential for a very large community. We believe
that Tez's extensible architecture will further encourage community
participation.
Also, related Apache projects (eg, Pig, Hive) have very large and active
communities, and we expect that over time Tez will also attract a large
community.
=== Core Developers ===
The developers on the initial committers list include people very
experienced
in the Apache Hadoop ecosystem:
* Alan Gates <gates at apache dot org>
* Arun C Murthy <acmurthy at apache dot org>
* Ashutosh Chauhan <hashutosh at apache dot org>
* Bikas Saha <bikas at apache dot org>
* Chris Douglas <cdouglas at apache dot org>
* Daryn Sharp <daryn at apache dot org>
* Devaraj Das <ddas at apache dot org>
* Gopal Vijayaraghavan <gopal at hortonworks dot com>
* Gunther Hagleitner <ghagleitner at hortonworks dot com>
* Hitesh Shah <hitesh at apache dot org>
* Jason Lowe <jlowe at apache dot org>
* Jean Xu <jeanxu at facebook dot com>
* Jitendra Pandey <jitendra at apache dot org>
* Julien Le Dem <julien at apache dot org>
* Kevin Wilfong <kevinwilfong at apache dot org>
* Mike Liddell <mike dot lidell at microsoft dot com>
* Namit Jain <namit at apache dot org>
* Nathan Roberts <nroberts at yahoo dash inc dot com>
* Owen O'Malley <omalley at apache dot org>
* Robert Evans <bobby at apache dot org>
* Siddharth Seth <sseth at apache dot org>
* Tom White <tomwhite at apache dot org>
* Thomas Graves <tgraves at apache dot org>
* Vikram Dixit <vikram at apache dot org>
* Vinod Kumar Vavilapalli <vinodkv at apache dot org>
* William Graham <billgraham at apache dot org>
We realize that though we have significant employer diversity already,
additional diversity is always better, and we will work
aggressively to recruit developers from additional companies.
=== Alignment ===
The initial committers strongly believe that a standard task DAG
application on Apache Hadoop YARN will gain broader adoption as an open
source,
community driven project, where the community can contribute not only to
the
core components, but also to a growing collection of applications which
will
be based on top of Tez. Our hope is that the Apache Hive, Apache Pig,
Cascading and other communities will find tremendous value in Tez and
will adopt
it en masse.
== Known Risks ==
=== Orphaned Products ===
The contributors are leading users and vendors in the Apache Hadoop
ecosystem,
with significant open source experience, so the risk of being orphaned is
relatively low. The project could be at risk if vendors decided to change
their strategies in the market. In such an event, the current committers
plan to continue working on the project on their own time, though the
progress will likely be slower. We plan to mitigate this risk by
recruiting additional committers.
=== Inexperience with Open Source ===
The initial committers include veteran Apache members (Committers, PMC
members
and Apache Members) and other developers who have varying degrees of
experience
with open source projects. All have been involved with source code that
has
been released under an open source license, and several also have
experience
developing code with an open source development process.
=== Homogenous Developers ===
The initial committers are employed by a number of companies, including
Cloudera, Facebook, Hortonworks, Microsoft, Twitter and Yahoo. We are
committed
to recruiting additional committers from other companies based on their
contributions to the project even though we do have significant diversity
already.
=== Reliance on Salaried Developers ===
It is expected that Tez development will occur on both salaried time and
on
volunteer time, after hours. The majority of initial committers are paid
by
their employer to contribute to this project. However, they are all
passionate
about the project, and we are confident that the project will continue
even if
no salaried developers contribute to the project. We are committed to
recruiting
additional committers including non-salaried developers.
=== Relationships with Other Apache Products ===
As mentioned in the Alignment section, Tez is closely integrated with
Hadoop,
Hive and Pig in a numerous ways. We look forward to collaborating with
those communities, as well as other Apache communities.
=== An Excessive Fascination with the Apache Brand ===
Tez solves a real need for generic task DAG management in the Apache
Hadoop
ecosystem, something which has been addressed in a very ad hoc manner so
far
by multiple Apache projects. Our rationale for developing Tez as an
Apache
project is detailed in the Rationale section. We believe that the Apache
brand
and community process will help us attract more contributors to this
project,
and help establish ubiquitous APIs.
== Documentation ==
http://wiki.apache.org/incubator/TezProposal
== Initial Source ==
Available as a patch.
== Cryptography ==
Tez will eventually support encryption on the wire. This is not one of
the initial
goals, and we do not expect Tez to be a controlled export item due to the
use
of encryption.
== Required Resources ==
=== Mailing List ===
* tez-private
* tez-dev
* tez-user
=== Subversion Directory ===
Git is the preferred source control system: git://git.apache.org/tez
=== Issue Tracking ===
JIRA Tez (TEZ)
== Initial Committers ==
* Alan Gates <gates at apache dot org>
* Arun C Murthy <acmurthy at apache dot org>
* Ashutosh Chauhan <hashutosh at apache dot org>
* Bikas Saha <bikas at apache dot org>
* Chris Douglas <cdouglas at apache dot org>
* Daryn Sharp <daryn at apache dot org>
* Devaraj Das <ddas at apache dot org>
* Gopal Vijayaraghavan <gopal at hortonworks dot com>
* Gunther Hagleitner <ghagleitner at hortonworks dot com>
* Hitesh Shah <hitesh at apache dot org>
* Jason Lowe <jlowe at apache dot org>
* Jean Xu <jeanxu at facebook dot com>
* Jitendra Pandey <jitendra at apache dot org>
* Julien Le Dem <julien at apache dot org>
* Kevin Wilfong <kevinwilfong at apache dot org>
* Mike Liddell <mike dot lidell at microsoft dot com>
* Namit Jain <namit at apache dot org>
* Nathan Roberts <nroberts at yahoo dash inc dot com>
* Owen O'Malley <omalley at apache dot org>
* Robert Evans <bobby at apache dot org>
* Siddharth Seth <sseth at apache dot org>
* Tom White <tomwhite at apache dot org>
* Thomas Graves <tgraves at apache dot org>
* Vikram Dixit <vikram at apache dot org>
* Vinod Kumar Vavilapalli <vinodkv at apache dot org>
* William Graham <billgraham at apache dot org>
== Affiliations ==
The initial committers are employees of Cloudera, Facebook, Hortonworks,
Microsoft, Twitter and Yahoo Inc.
* Alan Gates - Hortonworks
* Arun C Murthy - Hortonworks
* Ashutosh Chauhan - Hortonworks
* Bikas Saha - Hortonworks
* Chris Douglas - Microsoft
* Daryn Sharp - Yahoo
* Devaraj Das - Hortonworks
* Gopal Vijayaraghavan - Hortonworks
* Gunther Hagleitner - Hortonworks
* Hitesh Shah - Hortonworks
* Jason Lowe - Yahoo
* Jean Xu - Facebook
* Jitendra Pandey - Hortonworks
* Julien Le Dem - Twitter
* Kevin Wilfong - Facebook
* Mike Liddell - Microsoft
* Namit Jain - Facebook
* Nathan Roberts - Yahoo
* Owen O'Malley - Hortonworks
* Robert Evans - Yahoo
* Siddharth Seth - Hortonworks
* Tom White - Cloudera
* Thomas Graves - Yahoo
* Vikram Dixit - Hortonworks
* Vinod Kumar Vavilapalli - Hortonworks
* William Graham - Twitter
The nominated mentors are employees of Hortonworks, LinkedIn,
NASA JPL and Microsoft.
* Alan Gates - Hortonworks
* Arun C Murthy - Hortonworks
* Chris Douglas - Microsoft
* Chris Mattman - NASA JPL
* Jakob Homan - LinkedIn
* Owen O'Malley - Hortonworks
== Sponsors ==
=== Champion ===
Arun C Murthy <acmurthy at apache dot org>
=== Nominated Mentors ===
* Alan Gates <gates at apache dot org> – Architect at Hortonworks.
Committer for Pig.
* Arun C Murthy <acmurthy at apache dot org> – Architect at
Hortonworks. Committer for Hadoop.
* Chris Douglas <cdouglas at apache dot org> - Sr. Research Engineer at
Microsoft. Committer for Hadoop.
* Chris Mattman <mattmann at apache dot org> - Sr. Computer Scientist,
NASA JPL. Committer for Nutch, OODT and Tika.
* Jakob Homan <jghoman at apache dot org> – Sr. Software Engineer,
LinkedIn. Committer for Hadoop, Kafka, Giraph.
* Owen O'Malley <omalley at apache dot org> – Architect at
Hortonworks. Committer for Hadoop, Ambari.
=== Sponsoring Entity ===
Incubator