+1 (binding) On Tue, Nov 13, 2018 at 12:12 PM Dave Fisher <dave2w...@comcast.net> wrote:
> +1 (binding) > > > On Nov 13, 2018, at 9:10 AM, Matt Sicker <boa...@gmail.com> wrote: > > > > +1 binding > > > > On Tue, 13 Nov 2018 at 11:09, Ryan Blue <b...@apache.org> wrote: > > > >> +1 (binding) > >> > >> On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue <b...@apache.org> wrote: > >> > >>> The discuss thread seems to have reached consensus, so I propose > >> accepting > >>> the Iceberg project for incubation. > >>> > >>> The proposal is copied below and in the wiki: > >>> https://wiki.apache.org/incubator/IcebergProposal > >>> > >>> Please vote on whether to accept Iceberg in the next 72 hours: > >>> > >>> [ ] +1, accept Iceberg for incubation > >>> [ ] -1, reject the Iceberg proposal because . . . > >>> > >>> Thank you for reviewing the proposal and voting, > >>> > >>> rb > >>> ------------------------------ > >>> Iceberg Proposal Abstract > >>> > >>> Iceberg is a table format for large, slow-moving tabular data. > >>> > >>> It is designed to improve on the de-facto standard table layout built > >> into > >>> Apache Hive, Presto, and Apache Spark. > >>> Proposal > >>> > >>> The purpose of Iceberg is to provide SQL-like tables that are backed by > >>> large sets of data files. Iceberg is similar to the Hive table layout, > >> the > >>> de-facto standard structure used to track files in a table, but > provides > >>> additional guarantees and performance optimizations: > >>> > >>> - Atomicity - Each change to the table is will be complete or will > >>> fail. “Do or do not. There is no try.” > >>> - Snapshot isolation - Reads use one and only one snapshot of a table > >>> at some time without holding a lock. > >>> - Safe schema evolution - A table’s schema can change in well-defined > >>> ways, without breaking older data files. > >>> - Column projection - An engine may request a subset of the available > >>> columns, including nested fields. > >>> - Predicate pushdown - An engine can push filters into read planning > >>> to improve performance using partition data and file-level > statistics. > >>> > >>> Iceberg does NOT define a new file format. All data is stored in Apache > >>> Avro, Apache ORC, or Apache Parquet files. > >>> > >>> Additionally, Iceberg is designed to work well when data files are > stored > >>> in cloud blob stores, even when those systems provide weaker guarantees > >>> than a file system, including: > >>> > >>> - Eventual consistency in the namespace > >>> - High latency for directory listings > >>> - No renames of objects > >>> - No folder hierarchy > >>> > >>> Rationale > >>> > >>> Initial benchmarks show dramatic improvements in query planning. For > >>> example, in Netflix’s Atlas use case, which stores time-series metrics > >> from > >>> Netflix runtime systems and 1 month is stored across 2.7 million files > in > >>> 2,688 partitions: > >>> > >>> - Hive table using Parquet: > >>> - 400k+ splits, not combined > >>> - Explain query: 9.6 minutes wall time (planning only) > >>> - Iceberg table with partition filtering: > >>> - 15,218 splits, combined > >>> - Planning: 10 seconds > >>> - Query wall time: 13 minutes > >>> - Iceberg table with partition and min/max filtering: > >>> - 412 splits > >>> - Planning: 25 seconds > >>> - Query wall time: 42 seconds > >>> > >>> These performance gains combined with the cross-engine compatibility > are > >> a > >>> very compelling story. > >>> Initial Goals > >>> > >>> The initial goal will be to move the existing codebase to Apache and > >>> integrate with the Apache development process and infrastructure. A > >> primary > >>> goal of incubation will be to grow and diversify the Iceberg community. > >> We > >>> are well aware that the project community is largely comprised of > >>> individuals from a single company. We aim to change that during > >> incubation. > >>> Current Status > >>> > >>> As previously mentioned, Iceberg is under active development at > Netflix, > >>> and is being used in processing large volumes of data in Amazon EC2. > >>> > >>> Iceberg license documentation is already based on Apache guidelines for > >>> LICENSE and NOTICE content. > >>> Meritocracy > >>> > >>> We value meritocracy and we understand that it is the basis for an open > >>> community that encourages multiple companies and individuals to > >> contribute > >>> and be invested in the project’s future. We will encourage and monitor > >>> participation and make sure to extend privileges and responsibilities > to > >>> all contributors. > >>> Community > >>> > >>> Iceberg is currently being used by developers at Netflix and a growing > >>> number of users are actively using it in production environments. > Iceberg > >>> has received contributions from developers working at Hortonworks, > >> WeWork, > >>> and Palantir. By bringing Iceberg to Apache we aim to assure current > and > >>> future contributors that the Iceberg community is meritocratic and > open, > >> in > >>> order to broaden and diversity the user and developer community. > >>> Core Developers > >>> > >>> Iceberg was initially developed at Netflix and is under active > >>> development. We believe Netflix will be of interest to a broad range of > >>> users and developers and that incubating the project at the ASF will > help > >>> us build a diverse, sustainable community. > >>> Alignment > >>> > >>> Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive, ORC, > >>> Parquet, Pig, and Spark. We anticipate integration with additional > Apache > >>> projects as the Iceberg community and interest in the project grows. > >>> Known Risks Orphaned Products > >>> > >>> Netflix is committed to the future development of Iceberg and > understands > >>> that graduation to a TLP, while preferable, is not the only positive > >>> outcome of incubation. > >>> > >>> Should the Iceberg project be accepted by the Incubator, the > prospective > >>> PPMC would be willing to agree to a target incubation period of 2 years > >> or > >>> less, knowing that every Incubator project incurs a certain cost in > terms > >>> of ASF infrastructure and volunteer time. > >>> Inexperience with Open Source > >>> > >>> Three of the initial committers are Apache members and Incubator PMC > >>> members. They will work with the other community members to teach them > >> the > >>> Apache Way. > >>> Homogenous Developers > >>> > >>> The majority of the committers work at Netflix, though we are committed > >> to > >>> recruiting and developing additional committers from a wide spectrum of > >>> industries and backgrounds. > >>> Reliance on Salaried Developers > >>> > >>> It is expected that Iceberg development will occur on both salaried > time > >>> and on volunteer time, after hours. Most of the initial committers are > >> paid > >>> by Netflix to contribute to this project. However, they are all > >> passionate > >>> about the project, and we are both confident and hopeful that the > project > >>> will continue even if no salaried developers contribute to the project. > >>> Relationships with Other Apache Products > >>> > >>> As mentioned in the Rationale section, Iceberg utilizes a number of > >>> existing Apache projects (Avro, Hadoop, Hive, ORC, Parquet, Pig, & > >> Spark), > >>> and we expect that list to expand as the community grows and > diversifies. > >>> Any Apache project in the big data space that needs to store or process > >>> tabular data would be potentially relevant. > >>> An Excessive Fascination with the Apache Brand > >>> > >>> We are applying to the Incubator process because we think it is the > next > >>> logical step for the Iceberg project after open-sourcing the code. This > >>> proposal is not for the purpose of generating publicity. Rather, we > want > >> to > >>> make sure to create a very inclusive and meritocratic community, > outside > >>> the umbrella of a single company. Netflix has a long history of > >>> contributing to Apache projects and the Iceberg developers and > >> contributors > >>> understand the implication of making it an Apache project. > >>> Required Resources Mailing lists > >>> > >>> - d...@iceberg.incubator.apache.org > >>> - comm...@iceberg.incubator.apache.org > >>> - priv...@iceberg.incubator.apache.org > >>> > >>> The podling may also create a user mailing list, if needed. > >>> Source Control and Issue Tracking > >>> > >>> The Iceberg podling would use Apache’s gitbox integration to sync > between > >>> github and Apache infrastructure. The podling would use github issues > and > >>> pull requests for community engagement. > >>> Current Resources > >>> > >>> - Initial source: https://github.com/Netflix/iceberg > >>> - Java documentation: > >>> > >> > https://netflix.github.io/iceberg/current/javadoc/index.html?com/netflix/iceberg/package-summary.html > >>> - Table specification: > >>> > >> > https://docs.google.com/document/d/1Q-zL5lSCle6NEEdyfiYsXYzX_Q8Qf0ctMyGBKslOswA/edit > >>> > >>> Source and Intellectual Property Submission Plan > >>> > >>> The Iceberg source code in Github is currently licensed under Apache > >>> License v2.0 and the copyright is assigned to Netflix. If Iceberg > becomes > >>> an Incubator project at the ASF, Netflix will transfer the source code > >> and > >>> trademark ownership to the Apache Software Foundation via a Software > >> Grant > >>> Agreement. > >>> External Dependencies > >>> > >>> External dependencies licensed under Apache License 2.0 > >>> > >>> - Guava https://github.com/google/guava > >>> - Jackson https://github.com/FasterXML/jackson-core > >>> - Joda-Time http://www.joda.org/joda-time/ > >>> > >>> External dependencies licensed under the MIT License > >>> > >>> - SLF4J https://www.slf4j.org/ > >>> - Mockito https://github.com/mockito/mockito > >>> > >>> ASF Projects > >>> > >>> - Apache Avro > >>> - Apache Hadoop > >>> - Apache Hive > >>> - Apache ORC > >>> - Apache Parquet > >>> - Apache Pig > >>> - Apache Spark > >>> > >>> Cryptography > >>> > >>> We do not expect Iceberg to be a controlled export item due to the use > of > >>> encryption. > >>> Initial Committers and Affiliations > >>> > >>> - Ryan Blue b...@apache.org (Netflix) > >>> - Parth Brahmbhatt pa...@apache.org (Netflix) > >>> - Julien Le Dem jul...@apache.org (WeWork) > >>> - Owen O’Malley omal...@apache.org (Hortonworks) > >>> - Daniel Weeks dwe...@apache.org (Netflix) > >>> > >>> Sponsors and Nominated Mentors > >>> > >>> - Champion and mentor: Owen O’Malley omal...@apache.org > >>> - Mentor: Ryan Blue b...@apache.org > >>> - Mentor: Julien Le Dem jul...@apache.org > >>> > >>> Sponsoring Entity > >>> > >>> The Apache Incubator > >>> -- > >>> Ryan Blue > >>> > >> > >> > >> -- > >> Ryan Blue > >> > > > > > > -- > > Matt Sicker <boa...@gmail.com> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >