+1 (non-binding) > On Nov 13, 2018, at 9:19 AM, Owen O'Malley <owen.omal...@gmail.com> wrote: > > +1 (binding) > >> On Tue, Nov 13, 2018 at 12:12 PM Dave Fisher <dave2w...@comcast.net> wrote: >> >> +1 (binding) >> >>> On Nov 13, 2018, at 9:10 AM, Matt Sicker <boa...@gmail.com> wrote: >>> >>> +1 binding >>> >>>> On Tue, 13 Nov 2018 at 11:09, Ryan Blue <b...@apache.org> wrote: >>>> >>>> +1 (binding) >>>> >>>>> On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue <b...@apache.org> wrote: >>>>> >>>>> The discuss thread seems to have reached consensus, so I propose >>>> accepting >>>>> the Iceberg project for incubation. >>>>> >>>>> The proposal is copied below and in the wiki: >>>>> https://wiki.apache.org/incubator/IcebergProposal >>>>> >>>>> Please vote on whether to accept Iceberg in the next 72 hours: >>>>> >>>>> [ ] +1, accept Iceberg for incubation >>>>> [ ] -1, reject the Iceberg proposal because . . . >>>>> >>>>> Thank you for reviewing the proposal and voting, >>>>> >>>>> rb >>>>> ------------------------------ >>>>> Iceberg Proposal Abstract >>>>> >>>>> Iceberg is a table format for large, slow-moving tabular data. >>>>> >>>>> It is designed to improve on the de-facto standard table layout built >>>> into >>>>> Apache Hive, Presto, and Apache Spark. >>>>> Proposal >>>>> >>>>> The purpose of Iceberg is to provide SQL-like tables that are backed by >>>>> large sets of data files. Iceberg is similar to the Hive table layout, >>>> the >>>>> de-facto standard structure used to track files in a table, but >> provides >>>>> additional guarantees and performance optimizations: >>>>> >>>>> - Atomicity - Each change to the table is will be complete or will >>>>> fail. “Do or do not. There is no try.” >>>>> - Snapshot isolation - Reads use one and only one snapshot of a table >>>>> at some time without holding a lock. >>>>> - Safe schema evolution - A table’s schema can change in well-defined >>>>> ways, without breaking older data files. >>>>> - Column projection - An engine may request a subset of the available >>>>> columns, including nested fields. >>>>> - Predicate pushdown - An engine can push filters into read planning >>>>> to improve performance using partition data and file-level >> statistics. >>>>> >>>>> Iceberg does NOT define a new file format. All data is stored in Apache >>>>> Avro, Apache ORC, or Apache Parquet files. >>>>> >>>>> Additionally, Iceberg is designed to work well when data files are >> stored >>>>> in cloud blob stores, even when those systems provide weaker guarantees >>>>> than a file system, including: >>>>> >>>>> - Eventual consistency in the namespace >>>>> - High latency for directory listings >>>>> - No renames of objects >>>>> - No folder hierarchy >>>>> >>>>> Rationale >>>>> >>>>> Initial benchmarks show dramatic improvements in query planning. For >>>>> example, in Netflix’s Atlas use case, which stores time-series metrics >>>> from >>>>> Netflix runtime systems and 1 month is stored across 2.7 million files >> in >>>>> 2,688 partitions: >>>>> >>>>> - Hive table using Parquet: >>>>> - 400k+ splits, not combined >>>>> - Explain query: 9.6 minutes wall time (planning only) >>>>> - Iceberg table with partition filtering: >>>>> - 15,218 splits, combined >>>>> - Planning: 10 seconds >>>>> - Query wall time: 13 minutes >>>>> - Iceberg table with partition and min/max filtering: >>>>> - 412 splits >>>>> - Planning: 25 seconds >>>>> - Query wall time: 42 seconds >>>>> >>>>> These performance gains combined with the cross-engine compatibility >> are >>>> a >>>>> very compelling story. >>>>> Initial Goals >>>>> >>>>> The initial goal will be to move the existing codebase to Apache and >>>>> integrate with the Apache development process and infrastructure. A >>>> primary >>>>> goal of incubation will be to grow and diversify the Iceberg community. >>>> We >>>>> are well aware that the project community is largely comprised of >>>>> individuals from a single company. We aim to change that during >>>> incubation. >>>>> Current Status >>>>> >>>>> As previously mentioned, Iceberg is under active development at >> Netflix, >>>>> and is being used in processing large volumes of data in Amazon EC2. >>>>> >>>>> Iceberg license documentation is already based on Apache guidelines for >>>>> LICENSE and NOTICE content. >>>>> Meritocracy >>>>> >>>>> We value meritocracy and we understand that it is the basis for an open >>>>> community that encourages multiple companies and individuals to >>>> contribute >>>>> and be invested in the project’s future. We will encourage and monitor >>>>> participation and make sure to extend privileges and responsibilities >> to >>>>> all contributors. >>>>> Community >>>>> >>>>> Iceberg is currently being used by developers at Netflix and a growing >>>>> number of users are actively using it in production environments. >> Iceberg >>>>> has received contributions from developers working at Hortonworks, >>>> WeWork, >>>>> and Palantir. By bringing Iceberg to Apache we aim to assure current >> and >>>>> future contributors that the Iceberg community is meritocratic and >> open, >>>> in >>>>> order to broaden and diversity the user and developer community. >>>>> Core Developers >>>>> >>>>> Iceberg was initially developed at Netflix and is under active >>>>> development. We believe Netflix will be of interest to a broad range of >>>>> users and developers and that incubating the project at the ASF will >> help >>>>> us build a diverse, sustainable community. >>>>> Alignment >>>>> >>>>> Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive, ORC, >>>>> Parquet, Pig, and Spark. We anticipate integration with additional >> Apache >>>>> projects as the Iceberg community and interest in the project grows. >>>>> Known Risks Orphaned Products >>>>> >>>>> Netflix is committed to the future development of Iceberg and >> understands >>>>> that graduation to a TLP, while preferable, is not the only positive >>>>> outcome of incubation. >>>>> >>>>> Should the Iceberg project be accepted by the Incubator, the >> prospective >>>>> PPMC would be willing to agree to a target incubation period of 2 years >>>> or >>>>> less, knowing that every Incubator project incurs a certain cost in >> terms >>>>> of ASF infrastructure and volunteer time. >>>>> Inexperience with Open Source >>>>> >>>>> Three of the initial committers are Apache members and Incubator PMC >>>>> members. They will work with the other community members to teach them >>>> the >>>>> Apache Way. >>>>> Homogenous Developers >>>>> >>>>> The majority of the committers work at Netflix, though we are committed >>>> to >>>>> recruiting and developing additional committers from a wide spectrum of >>>>> industries and backgrounds. >>>>> Reliance on Salaried Developers >>>>> >>>>> It is expected that Iceberg development will occur on both salaried >> time >>>>> and on volunteer time, after hours. Most of the initial committers are >>>> paid >>>>> by Netflix to contribute to this project. However, they are all >>>> passionate >>>>> about the project, and we are both confident and hopeful that the >> project >>>>> will continue even if no salaried developers contribute to the project. >>>>> Relationships with Other Apache Products >>>>> >>>>> As mentioned in the Rationale section, Iceberg utilizes a number of >>>>> existing Apache projects (Avro, Hadoop, Hive, ORC, Parquet, Pig, & >>>> Spark), >>>>> and we expect that list to expand as the community grows and >> diversifies. >>>>> Any Apache project in the big data space that needs to store or process >>>>> tabular data would be potentially relevant. >>>>> An Excessive Fascination with the Apache Brand >>>>> >>>>> We are applying to the Incubator process because we think it is the >> next >>>>> logical step for the Iceberg project after open-sourcing the code. This >>>>> proposal is not for the purpose of generating publicity. Rather, we >> want >>>> to >>>>> make sure to create a very inclusive and meritocratic community, >> outside >>>>> the umbrella of a single company. Netflix has a long history of >>>>> contributing to Apache projects and the Iceberg developers and >>>> contributors >>>>> understand the implication of making it an Apache project. >>>>> Required Resources Mailing lists >>>>> >>>>> - d...@iceberg.incubator.apache.org >>>>> - comm...@iceberg.incubator.apache.org >>>>> - priv...@iceberg.incubator.apache.org >>>>> >>>>> The podling may also create a user mailing list, if needed. >>>>> Source Control and Issue Tracking >>>>> >>>>> The Iceberg podling would use Apache’s gitbox integration to sync >> between >>>>> github and Apache infrastructure. The podling would use github issues >> and >>>>> pull requests for community engagement. >>>>> Current Resources >>>>> >>>>> - Initial source: https://github.com/Netflix/iceberg >>>>> - Java documentation: >>>>> >>>> >> https://netflix.github.io/iceberg/current/javadoc/index.html?com/netflix/iceberg/package-summary.html >>>>> - Table specification: >>>>> >>>> >> https://docs.google.com/document/d/1Q-zL5lSCle6NEEdyfiYsXYzX_Q8Qf0ctMyGBKslOswA/edit >>>>> >>>>> Source and Intellectual Property Submission Plan >>>>> >>>>> The Iceberg source code in Github is currently licensed under Apache >>>>> License v2.0 and the copyright is assigned to Netflix. If Iceberg >> becomes >>>>> an Incubator project at the ASF, Netflix will transfer the source code >>>> and >>>>> trademark ownership to the Apache Software Foundation via a Software >>>> Grant >>>>> Agreement. >>>>> External Dependencies >>>>> >>>>> External dependencies licensed under Apache License 2.0 >>>>> >>>>> - Guava https://github.com/google/guava >>>>> - Jackson https://github.com/FasterXML/jackson-core >>>>> - Joda-Time http://www.joda.org/joda-time/ >>>>> >>>>> External dependencies licensed under the MIT License >>>>> >>>>> - SLF4J https://www.slf4j.org/ >>>>> - Mockito https://github.com/mockito/mockito >>>>> >>>>> ASF Projects >>>>> >>>>> - Apache Avro >>>>> - Apache Hadoop >>>>> - Apache Hive >>>>> - Apache ORC >>>>> - Apache Parquet >>>>> - Apache Pig >>>>> - Apache Spark >>>>> >>>>> Cryptography >>>>> >>>>> We do not expect Iceberg to be a controlled export item due to the use >> of >>>>> encryption. >>>>> Initial Committers and Affiliations >>>>> >>>>> - Ryan Blue b...@apache.org (Netflix) >>>>> - Parth Brahmbhatt pa...@apache.org (Netflix) >>>>> - Julien Le Dem jul...@apache.org (WeWork) >>>>> - Owen O’Malley omal...@apache.org (Hortonworks) >>>>> - Daniel Weeks dwe...@apache.org (Netflix) >>>>> >>>>> Sponsors and Nominated Mentors >>>>> >>>>> - Champion and mentor: Owen O’Malley omal...@apache.org >>>>> - Mentor: Ryan Blue b...@apache.org >>>>> - Mentor: Julien Le Dem jul...@apache.org >>>>> >>>>> Sponsoring Entity >>>>> >>>>> The Apache Incubator >>>>> -- >>>>> Ryan Blue >>>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> >>> >>> >>> -- >>> Matt Sicker <boa...@gmail.com> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> For additional commands, e-mail: general-h...@incubator.apache.org >> >>
--------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org