> > +1
> > From: Kenneth Knowles <k...@apache.org> > Date: Thu, Nov 15, 2018 at 10:01 AM > Subject: Re: [VOTE] Accept the Iceberg project for incubation > To: <general@incubator.apache.org> > > > +1 (non-binding) > > On Thu, Nov 15, 2018 at 9:57 AM Michael Wall <mjw...@apache.org> wrote: > > > +1 (binding) > > > > On Thu, Nov 15, 2018 at 3:03 AM Olivier Lamy <ol...@apache.org> wrote: > > > > > +1 > > > > > > On Wed, 14 Nov 2018 at 03:07, Ryan Blue <b...@apache.org> wrote: > > > > > > > The discuss thread seems to have reached consensus, so I propose > > > accepting > > > > the Iceberg project for incubation. > > > > > > > > The proposal is copied below and in the wiki: > > > > https://wiki.apache.org/incubator/IcebergProposal > > > > > > > > Please vote on whether to accept Iceberg in the next 72 hours: > > > > > > > > [ ] +1, accept Iceberg for incubation > > > > [ ] -1, reject the Iceberg proposal because . . . > > > > > > > > Thank you for reviewing the proposal and voting, > > > > > > > > rb > > > > ------------------------------ > > > > Iceberg Proposal Abstract > > > > > > > > Iceberg is a table format for large, slow-moving tabular data. > > > > > > > > It is designed to improve on the de-facto standard table layout built > > > into > > > > Apache Hive, Presto, and Apache Spark. > > > > Proposal > > > > > > > > The purpose of Iceberg is to provide SQL-like tables that are backed > by > > > > large sets of data files. Iceberg is similar to the Hive table > layout, > > > the > > > > de-facto standard structure used to track files in a table, but > > provides > > > > additional guarantees and performance optimizations: > > > > > > > > - Atomicity - Each change to the table is will be complete or will > > > fail. > > > > “Do or do not. There is no try.” > > > > - Snapshot isolation - Reads use one and only one snapshot of a > > table > > > at > > > > some time without holding a lock. > > > > - Safe schema evolution - A table’s schema can change in > > well-defined > > > > ways, without breaking older data files. > > > > - Column projection - An engine may request a subset of the > > available > > > > columns, including nested fields. > > > > - Predicate pushdown - An engine can push filters into read > planning > > > to > > > > improve performance using partition data and file-level > statistics. > > > > > > > > Iceberg does NOT define a new file format. All data is stored in > Apache > > > > Avro, Apache ORC, or Apache Parquet files. > > > > > > > > Additionally, Iceberg is designed to work well when data files are > > stored > > > > in cloud blob stores, even when those systems provide weaker > guarantees > > > > than a file system, including: > > > > > > > > - Eventual consistency in the namespace > > > > - High latency for directory listings > > > > - No renames of objects > > > > - No folder hierarchy > > > > > > > > Rationale > > > > > > > > Initial benchmarks show dramatic improvements in query planning. For > > > > example, in Netflix’s Atlas use case, which stores time-series > metrics > > > from > > > > Netflix runtime systems and 1 month is stored across 2.7 million > files > > in > > > > 2,688 partitions: > > > > > > > > - Hive table using Parquet: > > > > - 400k+ splits, not combined > > > > - Explain query: 9.6 minutes wall time (planning only) > > > > - Iceberg table with partition filtering: > > > > - 15,218 splits, combined > > > > - Planning: 10 seconds > > > > - Query wall time: 13 minutes > > > > - Iceberg table with partition and min/max filtering: > > > > - 412 splits > > > > - Planning: 25 seconds > > > > - Query wall time: 42 seconds > > > > > > > > > These performance gains combined with the cross-engine compatibility > > are > > > a > > > > very compelling story. > > > > Initial Goals > > > > > > > > The initial goal will be to move the existing codebase to Apache and > > > > integrate with the Apache development process and infrastructure. A > > > primary > > > > goal of incubation will be to grow and diversify the Iceberg > community. > > > We > > > > are well aware that the project community is largely comprised of > > > > individuals from a single company. We aim to change that during > > > incubation. > > > > Current Status > > > > > > > > As previously mentioned, Iceberg is under active development at > > Netflix, > > > > and is being used in processing large volumes of data in Amazon EC2. > > > > > > > > Iceberg license documentation is already based on Apache guidelines > for > > > > LICENSE and NOTICE content. > > > > Meritocracy > > > > > > > > We value meritocracy and we understand that it is the basis for an > open > > > > community that encourages multiple companies and individuals to > > > contribute > > > > and be invested in the project’s future. We will encourage and > monitor > > > > participation and make sure to extend privileges and responsibilities > > to > > > > all contributors. > > > > Community > > > > > > > > Iceberg is currently being used by developers at Netflix and a > growing > > > > number of users are actively using it in production environments. > > Iceberg > > > > has received contributions from developers working at Hortonworks, > > > WeWork, > > > > and Palantir. By bringing Iceberg to Apache we aim to assure current > > and > > > > future contributors that the Iceberg community is meritocratic and > > open, > > > in > > > > order to broaden and diversity the user and developer community. > > > > Core Developers > > > > > > > > Iceberg was initially developed at Netflix and is under active > > > development. > > > > We believe Netflix will be of interest to a broad range of users and > > > > developers and that incubating the project at the ASF will help us > > build > > > a > > > > diverse, sustainable community. > > > > Alignment > > > > > > > > Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive, > ORC, > > > > Parquet, Pig, and Spark. We anticipate integration with additional > > Apache > > > > projects as the Iceberg community and interest in the project grows. > > > > Known Risks Orphaned Products > > > > > > > > Netflix is committed to the future development of Iceberg and > > understands > > > > that graduation to a TLP, while preferable, is not the only positive > > > > outcome of incubation. > > > > > > > > Should the Iceberg project be accepted by the Incubator, the > > prospective > > > > PPMC would be willing to agree to a target incubation period of 2 > years > > > or > > > > less, knowing that every Incubator project incurs a certain cost in > > terms > > > > of ASF infrastructure and volunteer time. > > > > Inexperience with Open Source > > > > > > > > Three of the initial committers are Apache members and Incubator PMC > > > > members. They will work with the other community members to teach > them > > > the > > > > Apache Way. > > > > Homogenous Developers > > > > > > > > The majority of the committers work at Netflix, though we are > committed > > > to > > > > recruiting and developing additional committers from a wide spectrum > of > > > > industries and backgrounds. > > > > Reliance on Salaried Developers > > > > > > > > It is expected that Iceberg development will occur on both salaried > > time > > > > and on volunteer time, after hours. Most of the initial committers > are > > > paid > > > > by Netflix to contribute to this project. However, they are all > > > passionate > > > > about the project, and we are both confident and hopeful that the > > project > > > > will continue even if no salaried developers contribute to the > project. > > > > Relationships with Other Apache Products > > > > > > > > As mentioned in the Rationale section, Iceberg utilizes a number of > > > > existing Apache projects (Avro, Hadoop, Hive, ORC, Parquet, Pig, & > > > Spark), > > > > and we expect that list to expand as the community grows and > > diversifies. > > > > Any Apache project in the big data space that needs to store or > process > > > > tabular data would be potentially relevant. > > > > An Excessive Fascination with the Apache Brand > > > > > > > > We are applying to the Incubator process because we think it is the > > next > > > > logical step for the Iceberg project after open-sourcing the code. > This > > > > proposal is not for the purpose of generating publicity. Rather, we > > want > > > to > > > > make sure to create a very inclusive and meritocratic community, > > outside > > > > the umbrella of a single company. Netflix has a long history of > > > > contributing to Apache projects and the Iceberg developers and > > > contributors > > > > understand the implication of making it an Apache project. > > > > Required Resources Mailing lists > > > > > > > > - d...@iceberg.incubator.apache.org > > > > - comm...@iceberg.incubator.apache.org > > > > - priv...@iceberg.incubator.apache.org > > > > > > > > The podling may also create a user mailing list, if needed. > > > > Source Control and Issue Tracking > > > > > > > > The Iceberg podling would use Apache’s gitbox integration to sync > > between > > > > github and Apache infrastructure. The podling would use github issues > > and > > > > pull requests for community engagement. > > > > Current Resources > > > > > > > > - Initial source: https://github.com/Netflix/iceberg > > > > - Java documentation: > > > > > > > > > > > > > > https://netflix.github.io/iceberg/current/javadoc/index.html?com/netflix/iceberg/package-summary.html > > > > - Table specification: > > > > > > > > > > > > > > https://docs.google.com/document/d/1Q-zL5lSCle6NEEdyfiYsXYzX_Q8Qf0ctMyGBKslOswA/edit > > > > > > > > Source and Intellectual Property Submission Plan > > > > > > > > The Iceberg source code in Github is currently licensed under Apache > > > > License v2.0 and the copyright is assigned to Netflix. If Iceberg > > becomes > > > > an Incubator project at the ASF, Netflix will transfer the source > code > > > and > > > > trademark ownership to the Apache Software Foundation via a Software > > > Grant > > > > Agreement. > > > > External Dependencies > > > > > > > > External dependencies licensed under Apache License 2.0 > > > > > > > > - Guava https://github.com/google/guava > > > > - Jackson https://github.com/FasterXML/jackson-core > > > > - Joda-Time http://www.joda.org/joda-time/ > > > > > > > > External dependencies licensed under the MIT License > > > > > > > > - SLF4J https://www.slf4j.org/ > > > > - Mockito https://github.com/mockito/mockito > > > > > > > > ASF Projects > > > > > > > > - Apache Avro > > > > - Apache Hadoop > > > > - Apache Hive > > > > - Apache ORC > > > > - Apache Parquet > > > > - Apache Pig > > > > - Apache Spark > > > > > > > > Cryptography > > > > > > > > We do not expect Iceberg to be a controlled export item due to the > use > > of > > > > encryption. > > > > Initial Committers and Affiliations > > > > > > > > - Ryan Blue b...@apache.org (Netflix) > > > > - Parth Brahmbhatt pa...@apache.org (Netflix) > > > > - Julien Le Dem jul...@apache.org (WeWork) > > > > - Owen O’Malley omal...@apache.org (Hortonworks) > > > > - Daniel Weeks dwe...@apache.org (Netflix) > > > > > > > > Sponsors and Nominated Mentors > > > > > > > > - Champion and mentor: Owen O’Malley omal...@apache.org > > > > - Mentor: Ryan Blue b...@apache.org > > > > - Mentor: Julien Le Dem jul...@apache.org > > > > > > > > Sponsoring Entity > > > > > > > > The Apache Incubator > > > > -- > > > > Ryan Blue > > > > > > > > > > > > > -- > > > Olivier Lamy > > > http://twitter.com/olamy | http://linkedin.com/in/olamy > > > > > > > > > -- > Ryan Blue > Software Engineer > Netflix >