Re: [VOTE] accept Pig into Incubator

Brian McCallister Tue, 25 Sep 2007 11:32:22 -0700

+1

-Brian

On Sep 25, 2007, at 10:20 AM, Doug Cutting wrote:

I would like to call the Incubator PMC to vote to incubate theproposed Pig project. Discussion on this list evidenced broadinterest in this project, which bodes well for its ability to builda diverse developer community.
http://wiki.apache.org/incubator/PigProposal

+1

Doug

-----------------------------------------------------------

= Proposal for Pig Project =

== Abstract ==

Pig is a platform for analyzing large data sets.

== Proposal ==
The Pig project consists of high-level languages for expressingdata analysis programs, coupled with infrastructure for evaluatingthese programs. The salient property of Pig programs is that theirstructure is amenable to substantial parallelization, which inturns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of acompiler that produces sequences of Map-Reduce programs, for whichlarge-scale parallel implementations already exist (e.g., theHadoop subproject). Pig's language layer currently consists of atextual language called Pig Latin, which has the following keyproperties:
1. ''Ease of programming''. It is trivial to achieve parallelexecution of simple, "embarrassingly parallel" data analysis tasks.Complex tasks comprised of multiple interrelated datatransformations are explicitly encoded as data flow sequences,making them easy to write, understand, and maintain.2. ''Optimization opportunities''. The way in which tasks areencoded permits the system to optimize their executionautomatically, allowing the user to focus on semantics rather thanefficiency.3. ''Extensibility''. Users can create their own functions to dospecial-purpose processing.
== Background ==
Pig started as a research project at Yahoo! in May of 2006 tocombine ideas in parallel databases and distributed computing. Thefirst internal release took place in July 2006. The first releasewas a simple front-end to the Hadoop Map/Reduce framework. Thefollowing releases added new features and evolved the languagebased on user feedback. In July 2007, pig was taken over by adevelopment team and the first production version is due to bereleased on 9/28/07.
Since its inception, we had observed a steady growth of the usercommunity within Yahoo!. In April 2007, Pig was released under aBSD-type license. Several external parties are using this versionand have expressed interest in collaborating on its development.
== Rationale ==
In an information-centric world, innovation is driven by ad-hocanalysis of large data sets. For example, search engine companiesroutinely deploy and refine services based on analyzing therecorded behavior of users, publishers, and advertisers. The rateof innovation depends on the efficiency with which data can be
analyzed.
To analyze large data sets efficiently, one needs parallelism. Thecheapest and most scalable form of parallelism is clustercomputing. Unfortunately, programming for a cluster computingenvironment is difficult and time-consuming. Pig makes it easy toharness the power of cluster computing for ad-hoc data analysis.
While other language exist that try to achieve the same goals, webelieve that Pig provides more flexibility and gives more controlto the end user.
SQL typically requires (1) importing data from a user's preferredformat into a database system's internal format (2) well-structured, normalized data with a declared schema, and (3)programs expressed in declarative SELECT-FROM-WHERE blocks. Incontrast, Pig Latin facilitates (1) interoperability, i.e. data maybe read/written in a format accepted by other applications such astext editors or graph generators (2) flexibility, i.e. data may beloosely structured or have structure that isdefined operationally, and (3) adoption by programmers who findprocedural programming more natural than declarative programming.
Sawzall is a scripting language used at Google on top of Map-Reduce. A sawzall program has a fairly rigid structure consistingof a filtering phase (the map step) followed by an aggregationphase (the reduce step). Furthermore, only the filtering phase canbe written by the user, and only a pre-built set of aggregationsare available (new ones are non-trivial to add). While Pig Latinhas similar higher level primitives like filtering and aggregation,an arbitrary number of them can be flexibly chained together in aPig Latin program, and all primitives can use user-definedfunctions with equal ease. Further, Pig Latin has additionalprimitives such as cogrouping, that allow operations such as joins(which require multiple programs in Sawzall) to be written in asingle line in Pig Latin. Further, Pig Latin is designedto be embedded into other languages, and can use functions writtenin other languages. Thus, in contrast to Sawzall, it directlycaters to a large community of developers without having to makethem learn an entirely new programming language.
== Current Status ==

=== Meritocracy ===
Pig was started as a project that was developed by Yahoo! researchteam. Recently we have added a development team that works inharmony with the research team with both teams actively andsuccessfully contributing to the project. We are planning to createthe environment that encourages meritocracy and is consistent withthe meritocracy principles of Apache. Within the team we havepeople actively participating in the Hadoop subproject.
=== Community ===
Pig has an active user community within Yahoo! that has beensteadily growing. Pig also attracted external users since itsrelease under a BSD-type license. Several external parties areusing the product and have expressed interest in collaborating onits development.
Also, since the current version of Pig is built on top of theHadoop we believe that we will be able to quickly extend ourcommunity by attracting both the Hadoop users and developers to theproject.
=== Core Developers ===
Our contributors come from both research and development world andmost have background in database internals and large scaledistributed systems.
=== Alignment ===
Yahoo! seeks to develop Pig collaboratively with others, not tocontrol and maintain it independently. Apache offers the bestlegal and social framework for such community-based softwaredevelopment.
Also, the current version of Pig runs on top of the Hadoop's Map-Reduce infrastructure which is part of Apache. We believe therewould be a lot of synergy between the projects both in terms ofusers and developers.
== Known Risks ==
=== Orphaned products ===
All current contributors are part of Yahoo which is a major playerin the space and is committed to grid computing. Also we expecthigh degree of synergy with Hadoop subproject.
=== Inexperience with Open Source ===
Two of the committers have extensive experience with open sourceand Apache. The rest are new to open source and will be guidedthrough the process by the team members with experience.
=== Homogenous Developers ===
The current list of committers is confined to Yahoo employees. Ourplan is to recruit more committers once the project gets on the way.
=== Reliance on Salaried Developers ===
Currently, all contributors are Yahoo employees. By extending thedevelopment community we are hoping to mitigate this risk.
=== Relationships with Other Apache Products ===
Pig is built on top of Hadoop and we expect deep collaboration withHadoop subproject.
=== An Excessive Fascination with the Apache Brand ===
Yahoo already have a strong brand and is not interested in Apacheas a way to gain visibility. Yahoo! seeks to develop Pigcollaboratively with others, not to control and maintain itindependently. Apache offers the best legal and social frameworkfor such community-based software development.
== Documentation ==

http://research.yahoo.com/project/pig

== Initial Source ==
The initial source will be donated by Yahoo Inc. The donatingcompany will contribute the initial code base once the proposal isaccepted and necessary infrastructure has been set up.
== External Dependencies ==

 1. bzip2: http://www.kohsuke.org/bzip2/:Apache license
 2. javacc: https://javacc.dev.java.net/:BSD license
 3. hadoop: http://lucene.apache.org/hadoop/:Apache license
 4. log4j: http://logging.apache.org/log4j/: Apache license
5. jsch: http://www.jcraft.com/jsch: BSD style license: http://www.jcraft.com/jsch/LICENSE.txt
== Required Resources ==
== Mailing lists ==

We would need the following mailing lists
 1. pig-private (with moderated subscriptions)
 2. pig-dev
 3. pig-commits
 4. pig-user

=== Subversion Directory ===

https://svn.apache.org/repos/asf/incubator/pig

=== Issue Tracking ===

JIRA PIG (PIG)

== Initial Committers ==

 1. Nigel Daley ([EMAIL PROTECTED])
 2. Alan Gates ([EMAIL PROTECTED])
 3. Olga Natkovich ([EMAIL PROTECTED])
 4. Chris Olston ([EMAIL PROTECTED])
 5. Owen O'Malley ([EMAIL PROTECTED])
 6. Ben Reed ([EMAIL PROTECTED])
 7. Utkarsh Srivastava ([EMAIL PROTECTED])

== Affiliation ==

All initial committers are affiliated with Yahoo!

== Sponsors ==

=== Champion ===

Doug Cutting

=== Nominated Mentors ===

   1. Doug Cutting
   2. Torsten Curdt
   3. Bertrand Delacretaz
   4. Yoav Shapira
   5. Sylvain Wallez

=== Sponsoring Entity ===

Incubator


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] accept Pig into Incubator

Reply via email to