+1 (Non-binding)
Best, Arthur On Tue, Nov 13, 2018, 09:24 Hugo Louro <hmclo...@gmail.com wrote: > +1 (non-binding) > > > On Nov 13, 2018, at 9:19 AM, Owen O'Malley <owen.omal...@gmail.com> > wrote: > > > > +1 (binding) > > > >> On Tue, Nov 13, 2018 at 12:12 PM Dave Fisher <dave2w...@comcast.net> > wrote: > >> > >> +1 (binding) > >> > >>> On Nov 13, 2018, at 9:10 AM, Matt Sicker <boa...@gmail.com> wrote: > >>> > >>> +1 binding > >>> > >>>> On Tue, 13 Nov 2018 at 11:09, Ryan Blue <b...@apache.org> wrote: > >>>> > >>>> +1 (binding) > >>>> > >>>>> On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue <b...@apache.org> wrote: > >>>>> > >>>>> The discuss thread seems to have reached consensus, so I propose > >>>> accepting > >>>>> the Iceberg project for incubation. > >>>>> > >>>>> The proposal is copied below and in the wiki: > >>>>> https://wiki.apache.org/incubator/IcebergProposal > >>>>> > >>>>> Please vote on whether to accept Iceberg in the next 72 hours: > >>>>> > >>>>> [ ] +1, accept Iceberg for incubation > >>>>> [ ] -1, reject the Iceberg proposal because . . . > >>>>> > >>>>> Thank you for reviewing the proposal and voting, > >>>>> > >>>>> rb > >>>>> ------------------------------ > >>>>> Iceberg Proposal Abstract > >>>>> > >>>>> Iceberg is a table format for large, slow-moving tabular data. > >>>>> > >>>>> It is designed to improve on the de-facto standard table layout built > >>>> into > >>>>> Apache Hive, Presto, and Apache Spark. > >>>>> Proposal > >>>>> > >>>>> The purpose of Iceberg is to provide SQL-like tables that are backed > by > >>>>> large sets of data files. Iceberg is similar to the Hive table > layout, > >>>> the > >>>>> de-facto standard structure used to track files in a table, but > >> provides > >>>>> additional guarantees and performance optimizations: > >>>>> > >>>>> - Atomicity - Each change to the table is will be complete or will > >>>>> fail. “Do or do not. There is no try.” > >>>>> - Snapshot isolation - Reads use one and only one snapshot of a > table > >>>>> at some time without holding a lock. > >>>>> - Safe schema evolution - A table’s schema can change in > well-defined > >>>>> ways, without breaking older data files. > >>>>> - Column projection - An engine may request a subset of the > available > >>>>> columns, including nested fields. > >>>>> - Predicate pushdown - An engine can push filters into read planning > >>>>> to improve performance using partition data and file-level > >> statistics. > >>>>> > >>>>> Iceberg does NOT define a new file format. All data is stored in > Apache > >>>>> Avro, Apache ORC, or Apache Parquet files. > >>>>> > >>>>> Additionally, Iceberg is designed to work well when data files are > >> stored > >>>>> in cloud blob stores, even when those systems provide weaker > guarantees > >>>>> than a file system, including: > >>>>> > >>>>> - Eventual consistency in the namespace > >>>>> - High latency for directory listings > >>>>> - No renames of objects > >>>>> - No folder hierarchy > >>>>> > >>>>> Rationale > >>>>> > >>>>> Initial benchmarks show dramatic improvements in query planning. For > >>>>> example, in Netflix’s Atlas use case, which stores time-series > metrics > >>>> from > >>>>> Netflix runtime systems and 1 month is stored across 2.7 million > files > >> in > >>>>> 2,688 partitions: > >>>>> > >>>>> - Hive table using Parquet: > >>>>> - 400k+ splits, not combined > >>>>> - Explain query: 9.6 minutes wall time (planning only) > >>>>> - Iceberg table with partition filtering: > >>>>> - 15,218 splits, combined > >>>>> - Planning: 10 seconds > >>>>> - Query wall time: 13 minutes > >>>>> - Iceberg table with partition and min/max filtering: > >>>>> - 412 splits > >>>>> - Planning: 25 seconds > >>>>> - Query wall time: 42 seconds > >>>>> > >>>>> These performance gains combined with the cross-engine compatibility > >> are > >>>> a > >>>>> very compelling story. > >>>>> Initial Goals > >>>>> > >>>>> The initial goal will be to move the existing codebase to Apache and > >>>>> integrate with the Apache development process and infrastructure. A > >>>> primary > >>>>> goal of incubation will be to grow and diversify the Iceberg > community. > >>>> We > >>>>> are well aware that the project community is largely comprised of > >>>>> individuals from a single company. We aim to change that during > >>>> incubation. > >>>>> Current Status > >>>>> > >>>>> As previously mentioned, Iceberg is under active development at > >> Netflix, > >>>>> and is being used in processing large volumes of data in Amazon EC2. > >>>>> > >>>>> Iceberg license documentation is already based on Apache guidelines > for > >>>>> LICENSE and NOTICE content. > >>>>> Meritocracy > >>>>> > >>>>> We value meritocracy and we understand that it is the basis for an > open > >>>>> community that encourages multiple companies and individuals to > >>>> contribute > >>>>> and be invested in the project’s future. We will encourage and > monitor > >>>>> participation and make sure to extend privileges and responsibilities > >> to > >>>>> all contributors. > >>>>> Community > >>>>> > >>>>> Iceberg is currently being used by developers at Netflix and a > growing > >>>>> number of users are actively using it in production environments. > >> Iceberg > >>>>> has received contributions from developers working at Hortonworks, > >>>> WeWork, > >>>>> and Palantir. By bringing Iceberg to Apache we aim to assure current > >> and > >>>>> future contributors that the Iceberg community is meritocratic and > >> open, > >>>> in > >>>>> order to broaden and diversity the user and developer community. > >>>>> Core Developers > >>>>> > >>>>> Iceberg was initially developed at Netflix and is under active > >>>>> development. We believe Netflix will be of interest to a broad range > of > >>>>> users and developers and that incubating the project at the ASF will > >> help > >>>>> us build a diverse, sustainable community. > >>>>> Alignment > >>>>> > >>>>> Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive, > ORC, > >>>>> Parquet, Pig, and Spark. We anticipate integration with additional > >> Apache > >>>>> projects as the Iceberg community and interest in the project grows. > >>>>> Known Risks Orphaned Products > >>>>> > >>>>> Netflix is committed to the future development of Iceberg and > >> understands > >>>>> that graduation to a TLP, while preferable, is not the only positive > >>>>> outcome of incubation. > >>>>> > >>>>> Should the Iceberg project be accepted by the Incubator, the > >> prospective > >>>>> PPMC would be willing to agree to a target incubation period of 2 > years > >>>> or > >>>>> less, knowing that every Incubator project incurs a certain cost in > >> terms > >>>>> of ASF infrastructure and volunteer time. > >>>>> Inexperience with Open Source > >>>>> > >>>>> Three of the initial committers are Apache members and Incubator PMC > >>>>> members. They will work with the other community members to teach > them > >>>> the > >>>>> Apache Way. > >>>>> Homogenous Developers > >>>>> > >>>>> The majority of the committers work at Netflix, though we are > committed > >>>> to > >>>>> recruiting and developing additional committers from a wide spectrum > of > >>>>> industries and backgrounds. > >>>>> Reliance on Salaried Developers > >>>>> > >>>>> It is expected that Iceberg development will occur on both salaried > >> time > >>>>> and on volunteer time, after hours. Most of the initial committers > are > >>>> paid > >>>>> by Netflix to contribute to this project. However, they are all > >>>> passionate > >>>>> about the project, and we are both confident and hopeful that the > >> project > >>>>> will continue even if no salaried developers contribute to the > project. > >>>>> Relationships with Other Apache Products > >>>>> > >>>>> As mentioned in the Rationale section, Iceberg utilizes a number of > >>>>> existing Apache projects (Avro, Hadoop, Hive, ORC, Parquet, Pig, & > >>>> Spark), > >>>>> and we expect that list to expand as the community grows and > >> diversifies. > >>>>> Any Apache project in the big data space that needs to store or > process > >>>>> tabular data would be potentially relevant. > >>>>> An Excessive Fascination with the Apache Brand > >>>>> > >>>>> We are applying to the Incubator process because we think it is the > >> next > >>>>> logical step for the Iceberg project after open-sourcing the code. > This > >>>>> proposal is not for the purpose of generating publicity. Rather, we > >> want > >>>> to > >>>>> make sure to create a very inclusive and meritocratic community, > >> outside > >>>>> the umbrella of a single company. Netflix has a long history of > >>>>> contributing to Apache projects and the Iceberg developers and > >>>> contributors > >>>>> understand the implication of making it an Apache project. > >>>>> Required Resources Mailing lists > >>>>> > >>>>> - d...@iceberg.incubator.apache.org > >>>>> - comm...@iceberg.incubator.apache.org > >>>>> - priv...@iceberg.incubator.apache.org > >>>>> > >>>>> The podling may also create a user mailing list, if needed. > >>>>> Source Control and Issue Tracking > >>>>> > >>>>> The Iceberg podling would use Apache’s gitbox integration to sync > >> between > >>>>> github and Apache infrastructure. The podling would use github issues > >> and > >>>>> pull requests for community engagement. > >>>>> Current Resources > >>>>> > >>>>> - Initial source: https://github.com/Netflix/iceberg > >>>>> - Java documentation: > >>>>> > >>>> > >> > https://netflix.github.io/iceberg/current/javadoc/index.html?com/netflix/iceberg/package-summary.html > >>>>> - Table specification: > >>>>> > >>>> > >> > https://docs.google.com/document/d/1Q-zL5lSCle6NEEdyfiYsXYzX_Q8Qf0ctMyGBKslOswA/edit > >>>>> > >>>>> Source and Intellectual Property Submission Plan > >>>>> > >>>>> The Iceberg source code in Github is currently licensed under Apache > >>>>> License v2.0 and the copyright is assigned to Netflix. If Iceberg > >> becomes > >>>>> an Incubator project at the ASF, Netflix will transfer the source > code > >>>> and > >>>>> trademark ownership to the Apache Software Foundation via a Software > >>>> Grant > >>>>> Agreement. > >>>>> External Dependencies > >>>>> > >>>>> External dependencies licensed under Apache License 2.0 > >>>>> > >>>>> - Guava https://github.com/google/guava > >>>>> - Jackson https://github.com/FasterXML/jackson-core > >>>>> - Joda-Time http://www.joda.org/joda-time/ > >>>>> > >>>>> External dependencies licensed under the MIT License > >>>>> > >>>>> - SLF4J https://www.slf4j.org/ > >>>>> - Mockito https://github.com/mockito/mockito > >>>>> > >>>>> ASF Projects > >>>>> > >>>>> - Apache Avro > >>>>> - Apache Hadoop > >>>>> - Apache Hive > >>>>> - Apache ORC > >>>>> - Apache Parquet > >>>>> - Apache Pig > >>>>> - Apache Spark > >>>>> > >>>>> Cryptography > >>>>> > >>>>> We do not expect Iceberg to be a controlled export item due to the > use > >> of > >>>>> encryption. > >>>>> Initial Committers and Affiliations > >>>>> > >>>>> - Ryan Blue b...@apache.org (Netflix) > >>>>> - Parth Brahmbhatt pa...@apache.org (Netflix) > >>>>> - Julien Le Dem jul...@apache.org (WeWork) > >>>>> - Owen O’Malley omal...@apache.org (Hortonworks) > >>>>> - Daniel Weeks dwe...@apache.org (Netflix) > >>>>> > >>>>> Sponsors and Nominated Mentors > >>>>> > >>>>> - Champion and mentor: Owen O’Malley omal...@apache.org > >>>>> - Mentor: Ryan Blue b...@apache.org > >>>>> - Mentor: Julien Le Dem jul...@apache.org > >>>>> > >>>>> Sponsoring Entity > >>>>> > >>>>> The Apache Incubator > >>>>> -- > >>>>> Ryan Blue > >>>>> > >>>> > >>>> > >>>> -- > >>>> Ryan Blue > >>>> > >>> > >>> > >>> -- > >>> Matt Sicker <boa...@gmail.com> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > >> For additional commands, e-mail: general-h...@incubator.apache.org > >> > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >