Re: [discuss] Apache Gobblin Incubator Proposal

Olivier Lamy Tue, 14 Feb 2017 12:54:29 -0800

Hi
Well I don't see issues as no one discuss the proposal.
So I will start the official vote tomorrow.
Cheers
Olivier


On 6 February 2017 at 14:08, Olivier Lamy <ol...@apache.org> wrote:

> Hello everyone,
> I would like to submit to you a proposal to bring Gooblin to the Apache
> Software Foundation.
> The text of the proposal is included below and available as a draft here
> in the Wiki: https://wiki.apache.org/incubator/GobblinProposal
>
> We will appreciate any feedback and input.
>
> Olivier on behalf of the Gobblin community
>
>
> = Apache Gobblin Proposal =
> == Abstract ==
> Gobblin is a distributed data integration framework that simplifies common
> aspects of big data integration such as data ingestion, replication,
> organization and lifecycle management for both streaming and batch data
> ecosystems.
>
> == Proposal ==
>
> Gobblin is a universal data integration framework. The framework has been
> used to build a variety of big data applications such as ingestion,
> replication, and data retention. The fundamental constructs provided by the
> Gobblin framework are:
>
>  1. An expandable set of connectors that allow data to be integrated from
> a variety of sources and sinks. The range of connectors already available
> in Gobblin are quite diverse and are an ever expanding set. To highlight
> just a few examples, connectors exist for databases (e.g., MySQL, Oracle
> Teradata, Couchbase etc.), web based technologies (REST APIs, FTP/SFTP
> servers, Filers), scalable storage (HDFS, S3, Ambry etc,), streaming data
> (Kafka, EventHubs etc.), and a variety of proprietary data sources and
> sinks (e.g.Salesforce, Google Analytics, Google Webmaster etc.). Similarly,
> Gobblin has a rich library of converters that allow for conversion of data
> from one format to another as data moves across system boundaries (e.g.
> AVRO in HDFS to JSON in another system).
>
>
>  2. Gobblin has a well defined and customizable state management layer
> that allows writing stateful applications. These are particularly useful
> when solving problems like bulk incremental ingest and keeping several
> clusters replicated in sync. The ability to record work that has been
> completed and what remains in a scalable manner is critical to writing such
> diverse applications successfully.
>
>
>  3. Gobblin is agnostic to the underlying execution engine. It can be
> tailored to run ontop of a variety of execution frameworks ranging from
> multiple processes on a single node, to open source execution engines like
> MapReduce, Spark or Samza, natively on top of raw containers like Yarn or
> Mesos, and the public cloud like Amazon AWS or Microsoft Azure. We are
> extending Gobblin to run on top of a self managed cluster when security is
> vital.  This allows different applications that require different degrees
> of scalability, latency or security to be customized to for their specific
> needs. For example, highly latency sensitive applications can be executed
> in a streaming environment while batch based execution might benefit
> applications where the priority might be geared towards optimal container
> utilization.
>
>  4.Gobblin comes out of the box with several diagnosability features like
> Gobblin metrics and error handling. Collectively, these features allow
> Gobblin to operate at the scale of petabytes of data. To give just one
> example, the ability to quarantine a few bad records from an isolated Kafka
> topic without stopping the entire flow from continued execution is vital
> when the number of Kafka topics range in the thousands and the collective
> data handled is in the petabytes.
>
> Gobblin thus provides crisply defined software constructs that can be used
> to build a vast array of data integration applications customizable for
> varied user needs. It has become a preferred technology for data
> integration use-cases by many organizations worldwide (see a partial list
> here).
>
> == Background ==
>
> Over the last decade, data integration has evolved use case by use case in
> most companies. For example, at LinkedIn, when Kafka became a significant
> part of the data ecosystem, a system called Camus was built to ingest this
> data for analytics processing on Hadoop. Similarly, we had custom pipelines
> to ingest data from Salesforce, Oracle and myriad other sources. This
> pattern became the norm rather than the exception and one point, LinkedIn
> was running at least fifteen different types of ingestion pipelines. This
> fragmentation has several unfortunate implications. Operational costs scale
> with the number of pipelines even if the myriad pipelines share a vasty
> array of common features. Bug fixes and performance optimizations cannot be
> shared across the pipelines. A common set of practices around debugging and
> deployment does not emerge. Each pipeline operator will continue to invest
> in his little silo of the data integration world completely oblivious to
> the challenges of his fellow operator sitting five tables down.
>
> These experiences were the genesis behind the design and implementation of
> Gobblin. Gobblin thus started out as a universal data ingestion framework
> focussed on extracting, transforming, and synchronizing large volumes of
> data between different data sources and sinks. Not surprisingly, given its
> origins, the initial design of Gobblin placed great emphasis on
> abstractions that can be leveraged repeatedly. These abstractions have
> stood the test of time at LinkedIn and we have been able to leverage the
> constructs well beyond ingest. Gobblin's architecture has allowed us at
> LinkedIn to use it for a variety of applications ranging from from optimal
> format conversion to adhering to compliance policies set by European
> standards. Finally, as noted earlier, Gobblin can be deployed in a variety
> of execution environments: it can be deployed as a library embedded in
> another application or can be used to execute jobs on a public cloud. A
> fluid architectural and execution design story has allowed Gobblin to
> become a truly successful data integration platform.
>
> Gobblin has continued to evolve with a variety of utility packages like
> Gobblin metrics and Gobblin config management. Collectively, these allow
> organizations utilizing Gobblin to use a system that has been battle tested
> at LinkedIn scale. This is something that its consumers have to come to
> appreciate greatly.
>
> == Rationale ==
>
> Gobblin's entry to the Apache foundation is beneficial to both the Gobblin
> and the Apache communities. Gobblin has greatly benefited from its open
> source roots. Its community and adoption has grown greatly as a result.
> More importantly, the feedback from the community whether through
> interactions at meetups or through the mailing list have allowed for a rich
> exchange of ideas. In order to grow up the Gobblin community and improve
> the project, we would like to propose Gobblin to the Apache incubator. The
> Gobblin community will greatly benefit from the established development and
> consensus processes that have worked well for other projects. The Apache
> process has served many other open source projects well and we believe that
> the Gobblin community will greatly benefit from these practices as well.
>
> == Initial Goals ==
>
> Migrate the existing codebase to Apache
> Study and Integrate with the Apache development process
> Ensure all dependencies are compliant with Apache License version 2.0
> Incremental development and releases per Apache guidelines
> Improve the relationship between Gobblin and other Apache projects
>
> == Current Status ==
>
> Gobblin has undergone five major releases (0.5, 0.6, 0.7, 0.8, 0.9) and
> many minor ones. The latest version, Gobblin 0.9 has just been released in
> December, 2016. Gobblin is being used in production by over 20
> organizations. Gobblin codebase is currently hosted at github.com, which
> will seed the Apache git repository.
>
> === Meritocracy ===
>
> We plan to invest in supporting a meritocracy. We will discuss the
> requirements in an open forum. Several companies have already expressed
> interest in this project, and we intend to invite additional developers to
> participate. We will encourage and monitor community participation so that
> privileges can be extended to those that contribute.
>
> === Community ===
>
> The need for a extensible and flexible data integration platform in the
> open source is tremendous. Gobblin is currently being used by at least 20
> organizations worldwide (some examples are listed here). By bringing
> Gobblin into Apache, we believe that the community will grow even bigger.
>
> === Core Developers ===
>
> Gobblin was started by engineers at LinkedIn, and now has developers from
> Google, Facebook, LinkedIn, Cloudera, Nerdwallet, Swisscom, and many other
> companies.
>
> === Alignment ===
>
> Gobblin aligns exceedingly well with the Apache ecosystem. Gobblin is
> built leveraging several existing Apache projects (Apache Helix, Yarn,
> Zookeeper etc.). As Gobblin matures, we expect to leverage several other
> Apache projects further. This leverage invariably results in contributions
> back to these projects (e.g., a contribution to Helix was made during the
> Gobblin Yarn development). Finally, being an integration platform, it
> serves as a bridge between several Apache projects like Apache Hadoop and
> Apache Kafka. This integration is highly desired and their interaction
> through Gobblin will lead to a virtuous cycle of greater adoption and newer
> features in these projects. Thus, we believe that it will be a nice
> addition to the current set of big data projects under the auspices of the
> Apache foundation.
>
> == Known Risks ==
>
> === Orphaned Products ===
>
> The risk of the Gobblin project being abandoned is minimal. As noted
> earlier, there are many organizations that have already invested in Gobblin
> significantly and are thus incentivized to continue development. Many of
> these organizations operate critical data ingest, compliance and retention
> pipelines built with Gobblin and are thus heavily invested in the continued
> success of Gobblin.
>
> === Inexperience with Open Source ===
>
> Gobblin has existed as a healthy open source project for several years.
> During that time, we have curated an open-source community successfully.
> Any risks that we foresee are ones associated with scaling our open source
> communication and operation process rather than with inherent inexperience
> in operating an open source project.
>
> === Homogenous Developers ===
>
> Gobblin’s committers are employed by companies of varying sizes and
> industry. Committers come from well heeled internet companies like Google,
> LinkedIn and Facebook. We also have developers from traditional enterprise
> companies like SwissCom. Well funded startups like Nerdwallet are active in
> the community of developers. We  plan to double our efforts in cultivating
> a diverse set of committers for Gobblin.
>
> === Reliance on Salaried Developers ===
>
> It is expected that Gobblin development will occur on both salaried time
> and on volunteer time, after hours. The majority of initial committers are
> paid by their employer to contribute to this project. However, they are all
> passionate about the project, and we are confident that the project will
> continue even if no salaried developers contribute to the project. We are
> committed to recruiting additional committers including non-salaried
> developers.
>
> === Relationships with Other Apache Products ===
>
> As noted earlier, Gobblin leverages several open source projects and
> contributes back to them. There is also overlap with aspects of other
> Apache projects that we will discuss briefly here. Apache Nifi, like
> Gobblin aspires to reduce the operational overhead arising from data
> heterogeneity. Apache Nifi is structured as a visual flow based approach
> and provides built-in constructs for buffering data, prioritizing data, and
> understanding data lineage as data flows across systems. Apache Nifi has
> its own dataflow based execution engine with buffering, scheduling and
> streaming capabilities. Apache Falcon is a Hadoop centric data governance
> engine for defining, scheduling, and monitoring data management policies
> through flow definition typically for data that has been ingested into
> Hadoop already. Apache Falcon generally delegates data management jobs to
> tools that already exist in the Hadoop ecosystem (e.g. Distcp, Sqoop, Hive
> etc). Apache Sqoop is primarily geared for bulk ingest especially from
> databases which is one part of Gobblin’s feature set. Apache Flume focuses
> primarily on streaming data movement. Finally, general purpose data
> processing engines like Apache Flink, Apache Samza, and Apache Spark focus
> on generic computation.
>
> Gobblin design choices intersect with specific features in all of these
> systems, however in aggregate, it is a different point in the design space.
> It is designed to handle both streaming and batch data. It supports
> execution through a standalone cluster mode as well as through existing
> frameworks such as MR, Yarn, Hive, Samza etc allowing users to choose the
> deployment model that is optimal for the specific data integration
> challenge. It provides native optimized implementations for critical
> integrations such as Kafka, Hadoop - Hadoop copies etc. Gobblin also
> supports both Hadoop and non-Hadoop data, being able to ingest data into
> Kafka as well as other key-value stores like Couchbase. Gobblin is also not
> just a generic computation framework, it has specific constructs for data
> integration patterns such as data quality metrics and policies. Gobblin’s
> configuration management system allows it to be fully multi-tenant and take
> advantage of grouped policies when required. For batch workloads, Gobblin
> has a planning phase that provides for better resource utilization.
>
> In summary, there is healthy diversity in the number of systems
> approaching the interesting and pressing problem of big data integration.
> We believe that Gobblin will provide another compelling choice in that
> design space.
>
> === An Excessive Fascination with the Apache Brand ===
>
> Gobblin is already a healthy and well known open source project. This
> proposal is not for the purpose of generating publicity. Rather, the
> primary benefits to joining Apache are already outlined in the Rationale
> section.
>
> == Documentation ==
>
> The reader will find these websites highly relevant:
>  * Website: http://linkedin.github.io/gobblin/
>  * Documentation: https://gobblin.readthedocs.io/en/latest/
>  * Codebase: https://github.com/linkedin/gobblin/
>  * User group: https://groups.google.com/forum/#!forum/gobblin-users
>
> == Source and Intellectual Property Submission Plan ==
>
> The Gobblin codebase is currently hosted on Github. This is the exact
> codebase that we would migrate to the Apache foundation.The Gobblin source
> code is already licensed under Apache License Version 2.0. Going forward,
> we will continue to have all the contributions licensed directly to the
> Apache foundation through our signed Individual Contributor License
> Agreements for all the committers on the project.
>
> == External Dependencies ==
>
> To the best of our knowledge, all of Gobblin dependencies are distributed
> under Apache compatible licenses. Upon acceptance to the incubator, we
> would begin a thorough analysis of all transitive dependencies to verify
> this fact and introduce license checking into the build and release process
> (for instance integrating Apache Rat).
>
> == Cryptography ==
>
> We do not expect Gobblin to be a controlled export item due to the use of
> encryption.
>
> == Required Resources ==
>
> === Mailing lists ===
>
>  * gobblin-user
>  * gobblin-dev
>  * gobblin-commits
>  * gobblin-private for private PMC discussions (with moderated
> subscriptions)
>
> === Subversion Directory ===
>
> Git is the preferred source control system: git://git.apache.org/gobblin
>
> === Issue Tracking ===
>
> JIRA Gobblin (GOBBLIN)
>
> === Other Resources ===
>
>  The existing code already has unit and integration tests, so we would
> like a Jenkins instance to run them whenever a new patch is submitted. This
> can be added after project creation.
>
> == Initial Committers ==
>
>  * Abhishek Tiwari <abhishektiwari dot btech at gmail dot com>
>  * Shirshanka Das <shirshanka at apache dot org>
>  * Chavdar Botev <cbotev at gmail dot com>
>  * Sahil Takiar <takiar.sahil at gmail dot com>
>  * Yinan Li <liyinan926 at gmail dot com>
>  * Ziyang Liu <>
>  * Lorand Bendig <lbendig at gmail dot com>
>  * Issac Buenrostro <ibuenros at linkedin dot com>
>  * Hung Tran <hutran at linkedin dot com>
>  * Olivier Lamy <olamy at apache dot org>
>  * Jean-Baptiste Onofré <jbono...@apache.org>
>
> == Affiliations ==
>
>  * Abhishek Tiwari - LinkedIn
>  * Shirshanka Das - LinkedIn
>  * Chavdar Botev - Stealth Startup
>  * Sahil Takiar - Cloudera
>  * Yinan Li - Google
>  * Ziyang Liu - Facebook
>  * Lorand Bendig - Swisscom
>  * Issac Buenrostro - LinkedIn
>  * Hung Tran - LinkedIn
>  * Olivier Lamy - Webtide
>  * Jean-Baptiste Onofre - Talend
>
> == Sponsors ==
>
> === Champion ===
>
> Olivier Lamy < olamy at apache dot org>
>
> === Nominated Mentors ===
>
>  * Olivier Lamy <olamy at apache dot org>
>  * Jean-Baptiste Onofre <jbonofre at apache dot org>
>  * ?
>  * ?
>
> == Sponsoring Entity ==
> The Apache Incubator
>



-- 
Olivier Lamy
http://twitter.com/olamy | http://linkedin.com/in/olamy

Re: [discuss] Apache Gobblin Incubator Proposal

Reply via email to