+1 - Vinay
On Tue, 15 Jan 2019, 10:56 am Hongtao Gao <hanahm...@gmail.com wrote: > +1 > > Hongtao Gao > > > Thomas Weise <t...@apache.org> 于 2019年1月14日周一 上午6:34写道: > > > Hi all, > > > > Following the discussion of the Hudi proposal in [1], this is a vote > > on accepting Hudi into the Apache Incubator, > > per the ASF policy [2] and voting rules [3]. > > > > A vote for accepting a new Apache Incubator podling is a > > majority vote. Everyone is welcome to vote, only > > Incubator PMC member votes are binding. > > > > This vote will run for at least 72 hours. Please VOTE as > > follows: > > > > [ ] +1 Accept Hudi into the Apache Incubator > > [ ] +0 Abstain > > [ ] -1 Do not accept Hudi into the Apache Incubator because ... > > > > The proposal is included below, but you can also access it on > > the wiki [4]. > > > > Thanks for reviewing and voting, > > Thomas > > > > [1] > > > > > https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E > > > > [2] > > > > > https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor > > > > [3] http://www.apache.org/foundation/voting.html > > > > [4] https://wiki.apache.org/incubator/HudiProposal > > > > > > > > = Hudi Proposal = > > > > == Abstract == > > > > Hudi is a big-data storage library, that provides atomic upserts and > > incremental data streams. > > > > Hudi manages data stored in Apache Hadoop and other API compatible > > distributed file systems/cloud stores. > > > > == Proposal == > > > > Hudi provides the ability to atomically upsert datasets with new values > in > > near-real time, making data available quickly to existing query engines > > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a > > sequence of changes to a dataset from a given point-in-time to enable > > incremental data pipelines that yield greater efficiency & latency than > > their typical batch counterparts. By carefully managing number of files & > > sizes, Hudi greatly aids both query engines (e.g: always providing > > well-sized files) and underlying storage (e.g: HDFS NameNode memory > > consumption). > > > > Hudi is largely implemented as an Apache Spark library that reads/writes > > data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets > are > > supported via specialized Apache Hadoop input formats, that understand > > Hudi’s storage layout. Currently, Hudi manages datasets using a > combination > > of Apache Parquet & Apache Avro file/serialization formats. > > > > == Background == > > > > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud > > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as > > longer term analytical storage for thousands of organizations. Typical > > analytical datasets are built by reading data from a source (e.g: > upstream > > databases, messaging buses, or other datasets), transforming the data, > > writing results back to storage, & making it available for analytical > > queries--all of this typically accomplished in batch jobs which operate > in > > a bulk fashion on partitions of datasets. Such a style of processing > > typically incurs large delays in making data available to queries as well > > as lot of complexity in carefully partitioning datasets to guarantee > > latency SLAs. > > > > The need for fresher/faster analytics has increased enormously in the > past > > few years, as evidenced by the popularity of Stream processing systems > like > > Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By > > using updateable state store to incrementally compute & instantly reflect > > new results to queries and using a “tailable” messaging bus to publish > > these results to other downstream jobs, such systems employ a different > > approach to building analytical dataset. Even though this approach yields > > low latency, the amount of data managed in such real-time data-marts is > > typically limited in comparison to the aforementioned longer term storage > > options. As a result, the overall data architecture has become more > complex > > with more moving parts and specialized systems, leading to duplication of > > data and a strain on usability. > > > > Hudi takes a hybrid approach. Instead of moving vast amounts of batch > data > > to streaming systems, we simply add the streaming primitives (upserts & > > incremental consumption) onto existing batch processing technologies. We > > believe that by adding some missing blocks to an existing Hadoop stack, > we > > are able to a provide similar capabilities right on top of Hadoop at a > > reduced cost and with an increased efficiency, greatly simplifying the > > overall architecture in the process. > > > > Hudi was originally developed at Uber (original name “Hoodie”) to address > > such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s > data > > ecosystem that required the upsert & incremental consumption primitives > > supported by Hudi. > > > > == Rationale == > > > > We truly believe the capabilities supported by Hudi would be increasingly > > useful for big-data ecosystems, as data volumes & need for faster data > > continue to increase. A detailed description of target use-cases can be > > found at https://uber.github.io/hudi/use_cases.html. > > > > Given our reliance on so many great Apache projects, we believe that the > > Apache way of open source community driven development will enable us to > > evolve Hudi in collaboration with a diverse set of contributors who can > > bring new ideas into the project. > > > > == Initial Goals == > > > > * Move the existing codebase, website, documentation, and mailing lists > to > > an Apache-hosted infrastructure. > > * Integrate with the Apache development process. > > * Ensure all dependencies are compliant with Apache License version 2.0. > > * Incrementally develop and release per Apache guidelines. > > > > == Current Status == > > > > Hudi is a stable project used in production at Uber since 2016 and was > open > > sourced under the Apache License, Version 2.0 in 2017. At Uber, Hudi > > manages 4000+ tables holding several petabytes, bringing our Hadoop > > warehouse from several hours of data delays to under 30 minutes, over the > > past two years. The source code is currently hosted at github.com ( > > https://github.com/uber/hudi ), which will seed the Apache git > repository. > > > > === Meritocracy === > > > > We are fully committed to open, transparent, & meritocratic interactions > > with our community. In fact, one of the primary motivations for us to > enter > > the incubation process is to be able to rely on Apache best practices > that > > can ensure meritocracy. This will eventually help incorporate the best > > ideas back into the project & enable contributors to continue investing > > their time in the project. Current guidelines ( > > https://uber.github.io/hudi/community.html#becoming-a-committer) have > > already put in place a meritocratic process which we will replace with > > Apache guidelines during incubation. > > > > === Community === > > > > Hudi community is fairly young, since the project was open sourced only > in > > early 2017. Currently, Hudi has committers from Uber & Snowflake. We > have a > > vibrant set of contributors (~46 members in our slack channel) including > > Shopify, DoubleVerify and Vungle & others, who have either submitted > > patches or filed issues with hudi pipelines either in early production or > > testing stages. Our primary goal during the incubation would be to grow > the > > community and groom our existing active contributors into committers. > > > > === Core Developers === > > > > Current core developers work at Uber & Snowflake. We are confident that > > incubation will help us grow a diverse community in a open & > collaborative > > way. > > > > === Alignment === > > > > Hudi is designed as a general purpose analytical storage abstraction that > > integrates with multiple Apache projects: Apache Spark, Apache Hive, > Apache > > Hadoop. It was built using multiple Apache projects, including Apache > > Parquet and Apache Avro, that support near-real time analytics right on > top > > of existing Apache Hadoop data lakes. Our sincere hope is that being a > part > > of the Apache foundation would enable us to drive the future of the > project > > in alignment with the other Apache projects for the benefit of thousands > of > > organizations that already leverage these projects. > > > > == Known Risks == > > > > === Orphaned products === > > > > The risk of abandonment of Hudi is low. It is used in production at Uber > > for petabytes of data and other companies (mentioned in community > section) > > are either evaluating or in the early stage for production use. Uber is > > committed to further development of the project and invest resources > > towards the Apache processes & building the community, during incubation > > period. > > > > === Inexperience with Open Source === > > > > Even though the initial committers are new to the Apache world, some have > > considerable open source experience - Vinoth Chandar (Linkedin voldemort, > > Chromium), Prasanna Rajaperumal (Cloudera experience), Zeeshan Qureshi > > (Chromium) & Balaji Varadarajan (Linkedin Databus). We have been > > successfully managing the current open source community answering > questions > > and taking feedback already. Moreover, we hope to obtain guidance and > > mentorship from current ASF members to help us succeed with the > incubation. > > > > === Length of Incubation === > > > > We expect the project be in incubation for 2 years or less. > > > > === Homogenous Developers === > > > > Currently, the lead developers for Hudi are from Uber. However, we have > an > > active set of early contributors/collaborators from Shopify, DoubleVerify > > and Vungle, that we hope will increase the diversity going forward. Once > > again, a primary motivation for incubation is to facilitate this in the > > Apache way. > > > > === Reliance on Salaried Developers === > > > > Both the current committers & early contributors have several years of > core > > expertise around data systems. Current committers are very passionate > about > > the project and have already invested hundreds of hours towards helping & > > building the community. Thus, even with employer changes, we expect they > > will be able to actively engage in the project either because they will > be > > working in similar areas even with newer employers or out of belief in > the > > project. > > > > === Relationships with Other Apache Products === > > > > To the best of our knowledge, there are no direct competing projects with > > Hudi that offer all of the feature set namely - upserts, incremental > > streams, efficient storage/file management, snapshot isolation/rollbacks > - > > in a coherent way. However, some projects share common goals and > technical > > elements and we will highlight them here. Hive ACID/Kudu both offer > upsert > > capabilities without storage management/incremental streams. The recent > > Iceberg project offers similar snapshot isolation/rollbacks, but not > > upserts or other data plane features. A detailed comparison with their > > trade-offs can be found at https://uber.github.io/hudi/comparison.html. > > > > We are committed to open collaboration with such Apache projects and > > incorporate changes to Hudi or contribute patches to other projects, with > > the goal of making it easier for the community at large, to adopt these > > open source technologies. > > > > === Excessive Fascination with the Apache Brand === > > > > This proposal is not for the purpose of generating publicity. We have > > already been doing talks/meetups independently that have helped us build > > our community. We are drawn towards Apache as a potential way of ensuring > > that our open source community management is successful early on so hudi > > can evolve into a broadly accepted--and used--method of managing data on > > Hadoop. > > > > == Documentation == > > [1] Detailed documentation can be found at https://uber.github.io/hudi/ > > > > == Initial Source == > > > > The codebase is currently hosted on Github: https://github.com/uber/hudi > . > > During incubation, the codebase will be migrated to an Apache > > infrastructure. The source code already has an Apache 2.0 licensed. > > > > == Source and Intellectual Property Submission Plan == > > > > Current code is Apache 2.0 licensed and the copyright is assigned to > Uber. > > If the project enters incubator, Uber will transfer the source code & > > trademark ownership to ASF via a Software Grant Agreement > > > > == External Dependencies == > > > > Non apache dependencies are listed below > > > > * JCommander (1.48) Apache-2.0 > > * Kryo (4.0.0) BSD-2-Clause > > * Kryo (2.21) BSD-3-Clause > > * Jackson-annotations (2.6.4) Apache-2.0 > > * Jackson-annotations (2.6.5) Apache-2.0 > > * jackson-databind (2.6.4) Apache-2.0 > > * jackson-databind (2.6.5) Apache-2.0 > > * Jackson datatype: Guava (2.9.4) Apache-2.0 > > * docker-java (3.1.0-rc-3) Apache-2.0 > > * Guava: Google Core Libraries for Java (20.0) Apache-2.0 > > * bijection-avro (0.9.2) Apache-2.0 > > * com.twitter.common:objectsize (0.0.12) Apache-2.0 > > * Ascii Table (0.2.5) Apache-2.0 > > * config (3.0.0) Apache-2.0 > > * utils (3.0.0) Apache-2.0 > > * kafka-avro-serializer (3.0.0) Apache-2.0 > > * kafka-schema-registry-client (3.0.0) Apache-2.0 > > * Metrics Core (3.1.1) Apache-2.0 > > * Graphite Integration for Metrics (3.1.1) Apache-2.0 > > * Joda-Time (2.9.6) Apache-2.0 > > * JUnit CPL-1.0 > > * Awaitility (3.1.2) Apache-2.0 > > * jersey-connectors-apache (2.17) GPL-2.0-only CDDL-1.0 > > * jersey-container-servlet-core (2.17) GPL-2.0-only CDDL-1.0 > > * jersey-core-server (2.17) GPL-2.0-only CDDL-1.0 > > * htrace-core (3.0.4) Apache-2.0 > > * Mockito (1.10.19) MIT > > * scalatest (3.0.1) Apache-2.0 > > * Spring Shell (1.2.0.RELEASE) Apache-2.0 > > > > All of them are Apache compatible > > > > == Cryptography == > > > > No cryptographic libraries used > > > > == Required Resources == > > > > === Mailing lists === > > > > * priv...@hudi.incubator.apache.org (with moderated subscriptions) > > * d...@hudi.incubator.apache.org > > * comm...@hudi.incubator.apache.org > > * u...@hudi.incubator.apache.org > > > > === Git Repositories === > > > > Git is the preferred source control system: git:// > > git.apache.org/incubator-hudi > > > > === Issue Tracking === > > > > We prefer to use the Apache gitbox integration to sync Github & Apache > > infrastructure, and rely on Github issues & pull requests for community > > engagement. If this is not possible, then we prefer JIRA: Hudi (HUDI) > > > > == Initial Committers == > > > > * Vinoth Chandar (vinoth at uber dot com) (Uber) > > * Nishith Agarwal (nagarwal at uber dot com) (Uber) > > * Balaji Varadarajan (varadarb at uber dot com) (Uber) > > * Prasanna Rajaperumal (prasanna dot raj at gmail dot com) (Snowflake) > > * Zeeshan Qureshi (zeeshan dot qureshi at shopify dot com) (Shopify) > > * Anbu Cheeralan (alunarbeach at gmail dot com) (DoubleVerify) > > > > == Sponsors == > > > > === Champion === > > Julien Le Dem (julien at apache dot org) > > > > === Nominated Mentors === > > > > * Luciano Resende (lresende at apache dot org) > > * Thomas Weise (thw at apache dot org > > * Kishore Gopalakrishna (kishoreg at apache dot org) > > * Suneel Marthi (smarthi at apache dot org) > > > > === Sponsoring Entity === > > > > The Incubator PMC > > >