+1
Erik
On Sep 25, 2007, at 1:20 PM, Doug Cutting wrote:
I would like to call the Incubator PMC to vote to incubate the
proposed Pig project. Discussion on this list evidenced broad
interest in this project, which bodes well for its ability to build
a diverse developer community.
http://wiki.apache.org/incubator/PigProposal
+1
Doug
-----------------------------------------------------------
= Proposal for Pig Project =
== Abstract ==
Pig is a platform for analyzing large data sets.
== Proposal ==
The Pig project consists of high-level languages for expressing
data analysis programs, coupled with infrastructure for evaluating
these programs. The salient property of Pig programs is that their
structure is amenable to substantial parallelization, which in
turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a
compiler that produces sequences of Map-Reduce programs, for which
large-scale parallel implementations already exist (e.g., the
Hadoop subproject). Pig's language layer currently consists of a
textual language called Pig Latin, which has the following key
properties:
1. ''Ease of programming''. It is trivial to achieve parallel
execution of simple, "embarrassingly parallel" data analysis tasks.
Complex tasks comprised of multiple interrelated data
transformations are explicitly encoded as data flow sequences,
making them easy to write, understand, and maintain.
2. ''Optimization opportunities''. The way in which tasks are
encoded permits the system to optimize their execution
automatically, allowing the user to focus on semantics rather than
efficiency.
3. ''Extensibility''. Users can create their own functions to do
special-purpose processing.
== Background ==
Pig started as a research project at Yahoo! in May of 2006 to
combine ideas in parallel databases and distributed computing. The
first internal release took place in July 2006. The first release
was a simple front-end to the Hadoop Map/Reduce framework. The
following releases added new features and evolved the language
based on user feedback. In July 2007, pig was taken over by a
development team and the first production version is due to be
released on 9/28/07.
Since its inception, we had observed a steady growth of the user
community within Yahoo!. In April 2007, Pig was released under a
BSD-type license. Several external parties are using this version
and have expressed interest in collaborating on its development.
== Rationale ==
In an information-centric world, innovation is driven by ad-hoc
analysis of large data sets. For example, search engine companies
routinely deploy and refine services based on analyzing the
recorded behavior of users, publishers, and advertisers. The rate
of innovation depends on the efficiency with which data can be
analyzed.
To analyze large data sets efficiently, one needs parallelism. The
cheapest and most scalable form of parallelism is cluster
computing. Unfortunately, programming for a cluster computing
environment is difficult and time-consuming. Pig makes it easy to
harness the power of cluster computing for ad-hoc data analysis.
While other language exist that try to achieve the same goals, we
believe that Pig provides more flexibility and gives more control
to the end user.
SQL typically requires (1) importing data from a user's preferred
format into a database system's internal format (2) well-
structured, normalized data with a declared schema, and (3)
programs expressed in declarative SELECT-FROM-WHERE blocks. In
contrast, Pig Latin facilitates (1) interoperability, i.e. data may
be read/written in a format accepted by other applications such as
text editors or graph generators (2) flexibility, i.e. data may be
loosely structured or have structure that is
defined operationally, and (3) adoption by programmers who find
procedural programming more natural than declarative programming.
Sawzall is a scripting language used at Google on top of Map-
Reduce. A sawzall program has a fairly rigid structure consisting
of a filtering phase (the map step) followed by an aggregation
phase (the reduce step). Furthermore, only the filtering phase can
be written by the user, and only a pre-built set of aggregations
are available (new ones are non-trivial to add). While Pig Latin
has similar higher level primitives like filtering and aggregation,
an arbitrary number of them can be flexibly chained together in a
Pig Latin program, and all primitives can use user-defined
functions with equal ease. Further, Pig Latin has additional
primitives such as cogrouping, that allow operations such as joins
(which require multiple programs in Sawzall) to be written in a
single line in Pig Latin. Further, Pig Latin is designed
to be embedded into other languages, and can use functions written
in other languages. Thus, in contrast to Sawzall, it directly
caters to a large community of developers without having to make
them learn an entirely new programming language.
== Current Status ==
=== Meritocracy ===
Pig was started as a project that was developed by Yahoo! research
team. Recently we have added a development team that works in
harmony with the research team with both teams actively and
successfully contributing to the project. We are planning to create
the environment that encourages meritocracy and is consistent with
the meritocracy principles of Apache. Within the team we have
people actively participating in the Hadoop subproject.
=== Community ===
Pig has an active user community within Yahoo! that has been
steadily growing. Pig also attracted external users since its
release under a BSD-type license. Several external parties are
using the product and have expressed interest in collaborating on
its development.
Also, since the current version of Pig is built on top of the
Hadoop we believe that we will be able to quickly extend our
community by attracting both the Hadoop users and developers to the
project.
=== Core Developers ===
Our contributors come from both research and development world and
most have background in database internals and large scale
distributed systems.
=== Alignment ===
Yahoo! seeks to develop Pig collaboratively with others, not to
control and maintain it independently. Apache offers the best
legal and social framework for such community-based software
development.
Also, the current version of Pig runs on top of the Hadoop's Map-
Reduce infrastructure which is part of Apache. We believe there
would be a lot of synergy between the projects both in terms of
users and developers.
== Known Risks ==
=== Orphaned products ===
All current contributors are part of Yahoo which is a major player
in the space and is committed to grid computing. Also we expect
high degree of synergy with Hadoop subproject.
=== Inexperience with Open Source ===
Two of the committers have extensive experience with open source
and Apache. The rest are new to open source and will be guided
through the process by the team members with experience.
=== Homogenous Developers ===
The current list of committers is confined to Yahoo employees. Our
plan is to recruit more committers once the project gets on the way.
=== Reliance on Salaried Developers ===
Currently, all contributors are Yahoo employees. By extending the
development community we are hoping to mitigate this risk.
=== Relationships with Other Apache Products ===
Pig is built on top of Hadoop and we expect deep collaboration with
Hadoop subproject.
=== An Excessive Fascination with the Apache Brand ===
Yahoo already have a strong brand and is not interested in Apache
as a way to gain visibility. Yahoo! seeks to develop Pig
collaboratively with others, not to control and maintain it
independently. Apache offers the best legal and social framework
for such community-based software development.
== Documentation ==
http://research.yahoo.com/project/pig
== Initial Source ==
The initial source will be donated by Yahoo Inc. The donating
company will contribute the initial code base once the proposal is
accepted and necessary infrastructure has been set up.
== External Dependencies ==
1. bzip2: http://www.kohsuke.org/bzip2/:Apache license
2. javacc: https://javacc.dev.java.net/:BSD license
3. hadoop: http://lucene.apache.org/hadoop/:Apache license
4. log4j: http://logging.apache.org/log4j/: Apache license
5. jsch: http://www.jcraft.com/jsch: BSD style license: http://
www.jcraft.com/jsch/LICENSE.txt
== Required Resources ==
== Mailing lists ==
We would need the following mailing lists
1. pig-private (with moderated subscriptions)
2. pig-dev
3. pig-commits
4. pig-user
=== Subversion Directory ===
https://svn.apache.org/repos/asf/incubator/pig
=== Issue Tracking ===
JIRA PIG (PIG)
== Initial Committers ==
1. Nigel Daley ([EMAIL PROTECTED])
2. Alan Gates ([EMAIL PROTECTED])
3. Olga Natkovich ([EMAIL PROTECTED])
4. Chris Olston ([EMAIL PROTECTED])
5. Owen O'Malley ([EMAIL PROTECTED])
6. Ben Reed ([EMAIL PROTECTED])
7. Utkarsh Srivastava ([EMAIL PROTECTED])
== Affiliation ==
All initial committers are affiliated with Yahoo!
== Sponsors ==
=== Champion ===
Doug Cutting
=== Nominated Mentors ===
1. Doug Cutting
2. Torsten Curdt
3. Bertrand Delacretaz
4. Yoav Shapira
5. Sylvain Wallez
=== Sponsoring Entity ===
Incubator
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]