[DISCUSS] Iceberg proposal for incubation

Ryan Blue Mon, 05 Nov 2018 10:04:46 -0800

Hi everyone,

I'd like to start a discussion about moving Netflix's Iceberg project to
the incubator. Iceberg is a library and specification for tracking data
files that store table data in the big data ecosystem. Iceberg is designed
to guarantee snapshot isolation and improve on performance problems with
big data tables, especially when using S3 or another object store as the
source of truth.


We've written up an Iceberg proposal
<https://docs.google.com/document/d/1gRkXkIkEsupsBstv6h_6wwNixt3YhWKPSMuxkA-geV4/edit?usp=sharing>
in google docs. I can post it to the wiki as well if that's needed, but I
thought it may be easier to read and update after this discussion in docs.

The initial proposal is below. Thanks for taking a look!























*Iceberg ProposalAbstractIceberg is a new table format for storing large,
slow-moving tabular data. It is designed to improve on the de-facto
standard table layout built into Hive, Presto, and Spark.ProposalThe
purpose of Iceberg is to provide SQL-like tables that are backed by large
sets of data files. Iceberg is similar to the Hive table layout, the
de-facto standard structure used to track files in a table, but provides
additional guarantees and performance optimizations: - Atomicity - Each
change to the table is will be complete or will fail. “Do or do not. There
is no try.”- Snapshot isolation - Reads use one and only one snapshot of a
table at some time without holding a lock.- Safe schema evolution - A
table’s schema can change in well-defined ways, without breaking older data
files.- Column projection - An engine may request a subset of the available
columns, including nested fields.- Predicate pushdown - An engine can push
filters into read planning to improve performance using partition data and
file-level statistics.Iceberg does NOT define a new file format. All data
is stored in Avro, ORC, or Parquet files.Additionally, Iceberg is designed
to work well when data files are stored in cloud blob stores, even when
those systems provide weaker guarantees than a file system, including: -
Eventual consistency in the namespace- High latency for directory listings-
No renames of objects- No folder hierarchyRationaleInitial benchmarks show
dramatic improvements in query planning. For example, in Netflix’s Atlas
use case, which stores time-series metrics from Netflix runtime systems and
1 month is stored across 2.7 million files in 2,688 partitions: - Hive
table using Parquet:- 400k+ splits, not combined- Explain query: 9.6
minutes wall time (planning only)- Iceberg table with partition filtering:-
15,218 splits, combined- Planning: 10 seconds- Query wall time: 13 minutes-
Iceberg table with partition and min/max filtering:- 412 splits- Planning:
25 seconds- Query wall time: 42 secondsThese performance gains combined
with the cross-engine compatibility are a very compelling story.Initial
GoalsThe initial goal will be to move the existing codebase to Apache and
integrate with the Apache development process and infrastructure. A primary
goal of incubation will be to grow and diversify the Iceberg community. We
are well aware that the project community is largely comprised of
individuals from a single company. We aim to change that during
incubation.Current StatusAs previously mentioned, Iceberg is under active
development at Netflix, and is being used in processing large volumes of
data in Amazon EC2.MeritocracyWe value meritocracy and we understand that
it is the basis for an open community that encourages multiple companies
and individuals to contribute and be invested in the project’s future. We
will encourage and monitor participation and make sure to extend privileges
and responsibilities to all contributors.CommunityIceberg is currently
being used by developers at Netflix and a growing number of users are
actively using it in production environments. Iceberg has received
contributions from developers working at Hortonworks, WeWork, and Palantir.
By bringing Iceberg to Apache we aim to assure current and future
contributors that the Iceberg community is meritocratic and open, in order
to broaden and diversity the user and developer community.Core
DevelopersIceberg was initially developed at Netflix and is under active
development. We believe Netflix will be of interest to a broad range of
users and developers and that incubating the project at the ASF will help
us build a diverse, sustainable community.AlignmentIceberg utilizes other
Apache projects such as Avro, Hadoop, Hive, ORC, Parquet, Pig, and Spark.
We anticipate integration with additional Apache projects as the Iceberg
community and interest in the project grows.Known RisksOrphaned
ProductsNetflix  is committed to the future development of Iceberg and
understands that graduation to a TLP, while preferable, is not the only
positive outcome of incubation.Should the Iceberg project be accepted by
the Incubator, the prospective PPMC would be willing to agree to a target
incubation period of 2 years or less, knowing that every Incubator project
incurs a certain cost in terms of ASF infrastructure and volunteer
time.Inexperience with Open SourceThree of the initial committers are
Apache members and Incubator PMC members. They will work with the other
community members to teach them the Apache Way.Homogenous DevelopersThe
majority of the committers work at Netflix, though we are committed to
recruiting and developing additional committers from a wide spectrum of
industries and backgrounds.Reliance on Salaried DevelopersIt is expected
that Iceberg development will occur on both salaried time and on volunteer
time, after hours. Most of the initial committers are paid by Netflix to
contribute to this project. However, they are all passionate about the
project, and we are both confident and hopeful that the project will
continue even if no salaried developers contribute to the
project.Relationships with Other Apache ProductsAs mentioned in the
Rationale section, Iceberg utilizes a number of existing Apache projects
(Avro, Hadoop, Hive, ORC, Parquet, Pig, & Spark), and we expect that list
to expand as the community grows and diversifies. Any Apache project in the
big data space that needs to store or process tabular data would be
potentially relevant.A Excessive Fascination with the Apache BrandWe are
applying to the Incubator process because we think it is the next logical
step for the Iceberg project after open-sourcing the code. This proposal is
not for the purpose of generating publicity. Rather, we want to make sure
to create a very inclusive and meritocratic community, outside the umbrella
of a single company. Netflix has a long history of contributing to Apache
projects and the Iceberg developers and contributors understand the
implication of making it an Apache project.Required ResourcesMailing lists
- d...@iceberg.incubator.apache.org <d...@iceberg.incubator.apache.org>-
comm...@iceberg.incubator.apache.org
<comm...@iceberg.incubator.apache.org>-
priv...@iceberg.incubator.apache.org
<priv...@iceberg.incubator.apache.org>The podling may also create a user
mailing list, if needed.Source Control and Issue TrackingThe Iceberg
podling would use Apache’s gitbox integration to sync between github and
Apache infrastructure. The podling would use github issues and pull
requests for community engagement.Current Resources - Initial source:
github.com/Netflix/iceberg <https://github.com/Netflix/iceberg>- Java
documentation
<https://netflix.github.io/iceberg/current/javadoc/index.html?com/netflix/iceberg/package-summary.html>-
Table specification
<https://docs.google.com/document/d/1Q-zL5lSCle6NEEdyfiYsXYzX_Q8Qf0ctMyGBKslOswA/edit>Source
and Intellectual Property Submission PlanThe Iceberg  source code in Github
is currently licensed under Apache License v2.0 and the copyright is
assigned to Netflix. If Iceberg becomes an Incubator project at the ASF,
Netflix will transfer the source code and trademark ownership to the Apache
Software Foundation via a Software Grant Agreement.External
DependenciesExternal dependencies licensed under Apache License 2.0 - Guava
https://github.com/google/guava <https://github.com/google/guava>- Jackson
https://github.com/FasterXML/jackson-core
<https://github.com/FasterXML/jackson-core>- Joda-Time
http://www.joda.org/joda-time/ <http://www.joda.org/joda-time/>External
dependencies licensed under the MIT License - SLF4J https://www.slf4j.org/
<https://www.slf4j.org/>- Mockito https://github.com/mockito/mockito
<https://github.com/mockito/mockito>ASF Projects - Apache Avro- Apache
Hadoop- Apache Hive- Apache ORC- Apache Parquet- Apache Pig- Apache
SparkCryptographyNot applicable.Initial Committers - Ryan Blue
b...@apache.org <b...@apache.org>- Parth Brahmbhatt pa...@apache.org
<pa...@apache.org>- Julien Le Dem jul...@apache.org <jul...@apache.org>-
Owen O’Malley omal...@apache.org <omal...@apache.org>- Daniel Weeks
dwe...@netflix.com <dwe...@netflix.com>Sponsors - Champion and mentor: Owen
O’Malley omal...@apache.org <omal...@apache.org>- Mentor: Ryan Blue
b...@apache.org <b...@apache.org>- Mentor: Julien Le Dem jul...@apache.org
<jul...@apache.org>Sponsoring Entity - The Apache Incubator*

-- 
Ryan Blue
Software Engineer
Netflix

[DISCUSS] Iceberg proposal for incubation

Reply via email to