Re: [VOTE] Accept Parquet into the incubator

Brock Noland Mon, 19 May 2014 19:04:22 -0700

[X ] +1 Accept Parquet into the Incubator

non-binding



On Mon, May 19, 2014 at 11:24 AM, Andrew Purtell <apurt...@apache.org>wrote:

> +1 (binding)
>
>
> On Sun, May 18, 2014 at 2:15 PM, Chris Aniszczyk <caniszc...@gmail.com
> >wrote:
>
> > Based on the results of the discussion thread:
> >
> >
> http://mail-archives.apache.org/mod_mbox/incubator-general/201405.mbox/%3CCAJg1wMRGhLu4P7LeVQB%2B5K0C-fr-pw2448uj%3D6-3zHag4F1EbA%40mail.gmail.com%3E
> >
> > I would like to call a vote on accepting Parquet into the incubator.
> > https://wiki.apache.org/incubator/ParquetProposal
> >
> > [ ] +1 Accept Parquet into the Incubator
> > [ ] +0 Indifferent to the acceptance of Parquet
> > [ ] -1 Do not accept Parquet because ...
> >
> > The vote will be open until Thursday May 22nd 18:00 UTC.
> >
> > = Parquet Proposal =
> >
> > == Abstract ==
> > Parquet is a columnar storage format for Hadoop.
> >
> > == Proposal ==
> >
> > We created Parquet to make the advantages of compressed, efficient
> columnar
> > data representation available to any project in the Hadoop ecosystem,
> > regardless of the choice of data processing framework, data model, or
> > programming language.
> >
> > == Background ==
> >
> > Parquet is built from the ground up with complex nested data structures
> in
> > mind, and uses the repetition/definition level approach to encoding such
> > data structures, as popularized by Google Dremel (
> > https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We
> believe
> > this approach is superior to simple flattening of nested name spaces.
> >
> > Parquet is built to support very efficient compression and encoding
> > schemes. Parquet allows compression schemes to be specified on a
> per-column
> > level, and is future-proofed to allow adding more encodings as they are
> > invented and implemented. We separate the concepts of encoding and
> > compression, allowing parquet consumers to implement operators that work
> > directly on encoded data without paying decompression and decoding
> penalty
> > when possible.
> >
> > == Rationale ==
> >
> > Parquet is built to be used by anyone. We believe that an efficient,
> > well-implemented columnar storage substrate should be useful to all
> > frameworks without the cost of extensive and difficult to set up
> > dependencies.
> >
> > Furthermore, the rapid growth of Parquet community is empowered by open
> > source. We believe the Apache foundation is a great fit as the long-term
> > home for Parquet, as it provides an established process for
> > community-driven development and decision making by consensus. This is
> > exactly the model we want for future Parquet development.
> >
> > == Initial Goals ==
> >
> >  * Move the existing codebase to Apache
> >  * Integrate with the Apache development process
> >  * Ensure all dependencies are compliant with Apache License version 2.0
> >  * Incremental development and releases per Apache guidelines
> >
> > == Current Status ==
> >
> > Parquet has undergone 2 major releases:
> > https://github.com/Parquet/parquet-format/releases of the core format
> and
> > 22 releases: https://github.com/Parquet/parquet-mr/releases of the
> > supporting set of Java libraries.
> >
> > The Parquet source is currently hosted at GitHub, which will seed the
> > Apache git repository.
> >
> > === Meritocracy ===
> >
> > We plan to invest in supporting a meritocracy. We will discuss the
> > requirements in an open forum. Several companies have already expressed
> > interest in this project, and we intend to invite additional developers
> to
> > participate. We will encourage and monitor community participation so
> that
> > privileges can be extended to those that contribute.
> >
> > === Community ===
> >
> > There is a large need for an advanced columnar storage format for Hadoop.
> > Parquet is being used in production by many organizations (see
> > https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
> >
> >  * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
> >  * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
> >  * Salesforce:
> https://twitter.com/TwitterOSS/statuses/392734610116726784
> >  * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
> >  * Twitter: https://twitter.com/J_/statuses/315844725611581441
> >
> > By bringing Parquet into Apache, we believe that the community will grow
> > even bigger.
> >
> > === Core Developers ===
> >
> > Parquet was initially developed as a collaboration between Twitter,
> > Cloudera and Criteo.
> >
> > See
> >
> >
> https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
> >
> > === Alignment ===
> >
> > We believe that having Parquet at Apache will help further the growth of
> > the big-data community, as it will encourage cooperation within the
> greater
> > ecosystem of projects spawned by Apache Hadoop. The alignment is also
> > beneficial to other Apache communities (such as Hadoop, Hive, Avro).
> >
> > == Known Risks ==
> >
> > === Orphaned Products ===
> >
> > The risk of the Parquet project being abandoned is minimal. There are
> many
> > organizations using Parquet in production, including Twitter, Cloudera,
> > Stripe, and Salesforce (
> > http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
> >
> > === Inexperience with Open Source ===
> >
> > Parquet has existed as a healthy open source for one year. During that
> > time, we have curated an open-source community successfully, attracting
> > over 40 contributors (see
> > https://github.com/Parquet/parquet-mr/graphs/contributors) from a
> diverse
> > group of companies.
> > Several of the core contributors to the project are deeply familiar with
> > OSS and Apache specifically: Julien Le Dem was until recently the PMC
> Chair
> > for Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney
> > are also Apache Pig committers with contributions to several other Apache
> > projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
> > multiple other related projects. Brock Noland is a Hive committer.
> >
> > === Homogenous Developers ===
> >
> > The initial committers come from a number of companies and countries.
> > Parquet has an active community of developers, and we are committed to
> > recruiting additional committers based on their contributions to the
> > project. The java library component alone has contributions from 31
> > individual github accounts, 14 of which contributed over 1000 lines of
> > code.
> >
> > === Reliance on Salaried Developers ===
> >
> > It is expected that Parquet development will occur on both salaried time
> > and on volunteer time, after hours. The majority of initial committers
> are
> > paid by their employers to contribute to this project. However, they are
> > all passionate about the project, and we are confident that the project
> > will continue even if no salaried developers contribute to the project.
> As
> > evidence of this statement, we present the GitHub punchcard (see
> > https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a
> > lot
> > of activity happens on weekends. We are committed to recruiting
> additional
> > committers including non-salaried developers.
> >
> > === Relationships with Other Apache Products ===
> >
> > As mentioned in the Alignment section, Parquet is closely related to
> > Hadoop. It provides an API that allowed it to be easily integrated with
> > many other apache projects: Pig, Hive, Avro, Thrift, Spark, Drill,
> Crunch,
> > Tajo. Some of the features it provides are similar to the ORC file format
> > which is part of the Hive project. However Parquet focused on being
> > framework agnostic and language independent and has been really
> successful
> > to that end. On top of the Apache projects mentioned above, Parquet is
> also
> > integrated with other open source projects, including Protocol Buffers,
> > Cloudera Impala or Scrooge. We look forward to continue collaborating
> with
> > those communities, as well as other Apache communities.
> >
> > === An Excessive Fascination with the Apache Brand ===
> >
> > Parquet is an already healthy and well known open source project. This
> > proposal is not for the purpose of generating publicity. Rather, the
> > primary benefits to joining Apache are those outlined in the Rationale
> > section.
> >
> > == Documentation ==
> >
> > Documentation is currently located as README markdown files:
> >
> >  * https://github.com/Parquet/parquet-format
> >  * https://github.com/Parquet/parquet-mr
> >
> > == Source and Intellectual Property Submission Plan ==
> >
> > The Parquet codebase is currently hosted on Github:
> > https://github.com/Parquet.
> >
> > These are the codebases that we would migrate to the Apache foundation.
> >
> > == External Dependencies ==
> >
> >
> >  * Junit: EPL
> >  * Apache Commons: ALv2
> >  * Apache Thrift: ALv2
> >  * Apache Maven: ALv2
> >  * Apache Avro: ALv2
> >  * Apache Hadoop: ALv2
> >  * Google Guava: ALv2
> >  * Google Protobuf: New BSD License
> >
> > == Cryptography ==
> >
> > We do not expect Parquet to be a controlled export item due to the use of
> > encryption.
> >
> > == Required Resources ==
> >
> > === Mailing lists ===
> >
> >  * priv...@parquet.incubator.apache.org
> >  * comm...@parquet.incubator.apache.org
> >  * d...@parquet.incubator.apache.org
> >
> > == Subversion Directory ==
> >
> > Git is the preferred source control system:
> >
> >  * git://git.apache.org/parquet-format
> >  * git://git.apache.org/parquet-mr
> >
> > == Issue Tracking ==
> >
> > We'd like to keep using the Git review and issue tracking tools.
> > Controlling Pull requests closing through git commit messages in
> > git.apache.org
> >
> > == Initial Committers ==
> >
> >  * Aniket Mokashi <aniket...@gmail.com>
> >  * Brock Noland <br...@apache.org>
> >  * Chris Aniszczyk <caniszc...@gmail.com>
> >  * Dmitriy Ryaboy <dvrya...@apache.org>
> >  * Jake Farrell <jfarr...@apache.org>
> >  * Jonathan Coveney <jcove...@gmail.com>
> >  * Julien Le Dem <jul...@apache.org>
> >  * Lukas Nalezenec <lukas.naleze...@gmail.com>
> >  * Marcel Kornacker <mar...@cloudera.com>
> >  * Mickael Lacour
> >  * Nong Li <n...@cloudera.com>
> >  * Remy Pecqueur
> >  * Ryan Blue <b...@cloudera.com>
> >  * Tianshuo Deng <dengtians...@gmail.com>
> >  * Tom White <tomwh...@apache.org>
> >  * Wesley Peck
> >
> > == Affiliations ==
> >
> >  * Aniket Mokashi - Twitter
> >  * Brock Noland - Cloudera
> >  * Chris Aniszczyk - Twitter
> >  * Dmitriy Ryaboy - Twitter
> >  * Jake Farrell
> >  * Jonathan Coveney - Twitter
> >  * Julien Le Dem - Twitter
> >  * Lukas Nalezenec
> >  * Marcel Kornacker - Cloudera
> >  * Mickael Lacour - Criteo
> >  * Nong Li - Cloudera
> >  * Remy Pecqueur - Criteo
> >  * Ryan Blue - Cloudera
> >  * Tianshuo Deng - Twitter
> >  * Tom White - Cloudera
> >  * Wesley Peck - ARRIS, Inc.
> >
> > == Sponsors ==
> >
> > === Champion ===
> >
> >  * Todd Lipcon
> >
> > === Nominated Mentors ===
> >
> >  * Tom White
> >  * Chris Mattmann
> >  * Jake Farrell
> >  * Roman Shaposhnik
> >
> > === Sponsoring Entity ===
> >
> > The Apache Incubator
> >
> > --
> > Cheers,
> >
> > Chris Aniszczyk
> > http://aniszczyk.org
> > +1 512 961 6719
> >
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Re: [VOTE] Accept Parquet into the incubator

Reply via email to