Re: [discuss] Apache Gobblin Incubator Proposal

Jim Jagielski Wed, 15 Feb 2017 07:38:16 -0800

If you need/want another mentor, I volunteer

> On Feb 14, 2017, at 3:53 PM, Olivier Lamy <ol...@apache.org> wrote:
> 
> Hi
> Well I don't see issues as no one discuss the proposal.
> So I will start the official vote tomorrow.
> Cheers
> Olivier
> 
> On 6 February 2017 at 14:08, Olivier Lamy <ol...@apache.org> wrote:
> 
>> Hello everyone,
>> I would like to submit to you a proposal to bring Gooblin to the Apache
>> Software Foundation.
>> The text of the proposal is included below and available as a draft here
>> in the Wiki: https://wiki.apache.org/incubator/GobblinProposal
>> 
>> We will appreciate any feedback and input.
>> 
>> Olivier on behalf of the Gobblin community
>> 
>> 
>> = Apache Gobblin Proposal =
>> == Abstract ==
>> Gobblin is a distributed data integration framework that simplifies common
>> aspects of big data integration such as data ingestion, replication,
>> organization and lifecycle management for both streaming and batch data
>> ecosystems.
>> 
>> == Proposal ==
>> 
>> Gobblin is a universal data integration framework. The framework has been
>> used to build a variety of big data applications such as ingestion,
>> replication, and data retention. The fundamental constructs provided by the
>> Gobblin framework are:
>> 
>> 1. An expandable set of connectors that allow data to be integrated from
>> a variety of sources and sinks. The range of connectors already available
>> in Gobblin are quite diverse and are an ever expanding set. To highlight
>> just a few examples, connectors exist for databases (e.g., MySQL, Oracle
>> Teradata, Couchbase etc.), web based technologies (REST APIs, FTP/SFTP
>> servers, Filers), scalable storage (HDFS, S3, Ambry etc,), streaming data
>> (Kafka, EventHubs etc.), and a variety of proprietary data sources and
>> sinks (e.g.Salesforce, Google Analytics, Google Webmaster etc.). Similarly,
>> Gobblin has a rich library of converters that allow for conversion of data
>> from one format to another as data moves across system boundaries (e.g.
>> AVRO in HDFS to JSON in another system).
>> 
>> 
>> 2. Gobblin has a well defined and customizable state management layer
>> that allows writing stateful applications. These are particularly useful
>> when solving problems like bulk incremental ingest and keeping several
>> clusters replicated in sync. The ability to record work that has been
>> completed and what remains in a scalable manner is critical to writing such
>> diverse applications successfully.
>> 
>> 
>> 3. Gobblin is agnostic to the underlying execution engine. It can be
>> tailored to run ontop of a variety of execution frameworks ranging from
>> multiple processes on a single node, to open source execution engines like
>> MapReduce, Spark or Samza, natively on top of raw containers like Yarn or
>> Mesos, and the public cloud like Amazon AWS or Microsoft Azure. We are
>> extending Gobblin to run on top of a self managed cluster when security is
>> vital.  This allows different applications that require different degrees
>> of scalability, latency or security to be customized to for their specific
>> needs. For example, highly latency sensitive applications can be executed
>> in a streaming environment while batch based execution might benefit
>> applications where the priority might be geared towards optimal container
>> utilization.
>> 
>> 4.Gobblin comes out of the box with several diagnosability features like
>> Gobblin metrics and error handling. Collectively, these features allow
>> Gobblin to operate at the scale of petabytes of data. To give just one
>> example, the ability to quarantine a few bad records from an isolated Kafka
>> topic without stopping the entire flow from continued execution is vital
>> when the number of Kafka topics range in the thousands and the collective
>> data handled is in the petabytes.
>> 
>> Gobblin thus provides crisply defined software constructs that can be used
>> to build a vast array of data integration applications customizable for
>> varied user needs. It has become a preferred technology for data
>> integration use-cases by many organizations worldwide (see a partial list
>> here).
>> 
>> == Background ==
>> 
>> Over the last decade, data integration has evolved use case by use case in
>> most companies. For example, at LinkedIn, when Kafka became a significant
>> part of the data ecosystem, a system called Camus was built to ingest this
>> data for analytics processing on Hadoop. Similarly, we had custom pipelines
>> to ingest data from Salesforce, Oracle and myriad other sources. This
>> pattern became the norm rather than the exception and one point, LinkedIn
>> was running at least fifteen different types of ingestion pipelines. This
>> fragmentation has several unfortunate implications. Operational costs scale
>> with the number of pipelines even if the myriad pipelines share a vasty
>> array of common features. Bug fixes and performance optimizations cannot be
>> shared across the pipelines. A common set of practices around debugging and
>> deployment does not emerge. Each pipeline operator will continue to invest
>> in his little silo of the data integration world completely oblivious to
>> the challenges of his fellow operator sitting five tables down.
>> 
>> These experiences were the genesis behind the design and implementation of
>> Gobblin. Gobblin thus started out as a universal data ingestion framework
>> focussed on extracting, transforming, and synchronizing large volumes of
>> data between different data sources and sinks. Not surprisingly, given its
>> origins, the initial design of Gobblin placed great emphasis on
>> abstractions that can be leveraged repeatedly. These abstractions have
>> stood the test of time at LinkedIn and we have been able to leverage the
>> constructs well beyond ingest. Gobblin's architecture has allowed us at
>> LinkedIn to use it for a variety of applications ranging from from optimal
>> format conversion to adhering to compliance policies set by European
>> standards. Finally, as noted earlier, Gobblin can be deployed in a variety
>> of execution environments: it can be deployed as a library embedded in
>> another application or can be used to execute jobs on a public cloud. A
>> fluid architectural and execution design story has allowed Gobblin to
>> become a truly successful data integration platform.
>> 
>> Gobblin has continued to evolve with a variety of utility packages like
>> Gobblin metrics and Gobblin config management. Collectively, these allow
>> organizations utilizing Gobblin to use a system that has been battle tested
>> at LinkedIn scale. This is something that its consumers have to come to
>> appreciate greatly.
>> 
>> == Rationale ==
>> 
>> Gobblin's entry to the Apache foundation is beneficial to both the Gobblin
>> and the Apache communities. Gobblin has greatly benefited from its open
>> source roots. Its community and adoption has grown greatly as a result.
>> More importantly, the feedback from the community whether through
>> interactions at meetups or through the mailing list have allowed for a rich
>> exchange of ideas. In order to grow up the Gobblin community and improve
>> the project, we would like to propose Gobblin to the Apache incubator. The
>> Gobblin community will greatly benefit from the established development and
>> consensus processes that have worked well for other projects. The Apache
>> process has served many other open source projects well and we believe that
>> the Gobblin community will greatly benefit from these practices as well.
>> 
>> == Initial Goals ==
>> 
>> Migrate the existing codebase to Apache
>> Study and Integrate with the Apache development process
>> Ensure all dependencies are compliant with Apache License version 2.0
>> Incremental development and releases per Apache guidelines
>> Improve the relationship between Gobblin and other Apache projects
>> 
>> == Current Status ==
>> 
>> Gobblin has undergone five major releases (0.5, 0.6, 0.7, 0.8, 0.9) and
>> many minor ones. The latest version, Gobblin 0.9 has just been released in
>> December, 2016. Gobblin is being used in production by over 20
>> organizations. Gobblin codebase is currently hosted at github.com, which
>> will seed the Apache git repository.
>> 
>> === Meritocracy ===
>> 
>> We plan to invest in supporting a meritocracy. We will discuss the
>> requirements in an open forum. Several companies have already expressed
>> interest in this project, and we intend to invite additional developers to
>> participate. We will encourage and monitor community participation so that
>> privileges can be extended to those that contribute.
>> 
>> === Community ===
>> 
>> The need for a extensible and flexible data integration platform in the
>> open source is tremendous. Gobblin is currently being used by at least 20
>> organizations worldwide (some examples are listed here). By bringing
>> Gobblin into Apache, we believe that the community will grow even bigger.
>> 
>> === Core Developers ===
>> 
>> Gobblin was started by engineers at LinkedIn, and now has developers from
>> Google, Facebook, LinkedIn, Cloudera, Nerdwallet, Swisscom, and many other
>> companies.
>> 
>> === Alignment ===
>> 
>> Gobblin aligns exceedingly well with the Apache ecosystem. Gobblin is
>> built leveraging several existing Apache projects (Apache Helix, Yarn,
>> Zookeeper etc.). As Gobblin matures, we expect to leverage several other
>> Apache projects further. This leverage invariably results in contributions
>> back to these projects (e.g., a contribution to Helix was made during the
>> Gobblin Yarn development). Finally, being an integration platform, it
>> serves as a bridge between several Apache projects like Apache Hadoop and
>> Apache Kafka. This integration is highly desired and their interaction
>> through Gobblin will lead to a virtuous cycle of greater adoption and newer
>> features in these projects. Thus, we believe that it will be a nice
>> addition to the current set of big data projects under the auspices of the
>> Apache foundation.
>> 
>> == Known Risks ==
>> 
>> === Orphaned Products ===
>> 
>> The risk of the Gobblin project being abandoned is minimal. As noted
>> earlier, there are many organizations that have already invested in Gobblin
>> significantly and are thus incentivized to continue development. Many of
>> these organizations operate critical data ingest, compliance and retention
>> pipelines built with Gobblin and are thus heavily invested in the continued
>> success of Gobblin.
>> 
>> === Inexperience with Open Source ===
>> 
>> Gobblin has existed as a healthy open source project for several years.
>> During that time, we have curated an open-source community successfully.
>> Any risks that we foresee are ones associated with scaling our open source
>> communication and operation process rather than with inherent inexperience
>> in operating an open source project.
>> 
>> === Homogenous Developers ===
>> 
>> Gobblin’s committers are employed by companies of varying sizes and
>> industry. Committers come from well heeled internet companies like Google,
>> LinkedIn and Facebook. We also have developers from traditional enterprise
>> companies like SwissCom. Well funded startups like Nerdwallet are active in
>> the community of developers. We  plan to double our efforts in cultivating
>> a diverse set of committers for Gobblin.
>> 
>> === Reliance on Salaried Developers ===
>> 
>> It is expected that Gobblin development will occur on both salaried time
>> and on volunteer time, after hours. The majority of initial committers are
>> paid by their employer to contribute to this project. However, they are all
>> passionate about the project, and we are confident that the project will
>> continue even if no salaried developers contribute to the project. We are
>> committed to recruiting additional committers including non-salaried
>> developers.
>> 
>> === Relationships with Other Apache Products ===
>> 
>> As noted earlier, Gobblin leverages several open source projects and
>> contributes back to them. There is also overlap with aspects of other
>> Apache projects that we will discuss briefly here. Apache Nifi, like
>> Gobblin aspires to reduce the operational overhead arising from data
>> heterogeneity. Apache Nifi is structured as a visual flow based approach
>> and provides built-in constructs for buffering data, prioritizing data, and
>> understanding data lineage as data flows across systems. Apache Nifi has
>> its own dataflow based execution engine with buffering, scheduling and
>> streaming capabilities. Apache Falcon is a Hadoop centric data governance
>> engine for defining, scheduling, and monitoring data management policies
>> through flow definition typically for data that has been ingested into
>> Hadoop already. Apache Falcon generally delegates data management jobs to
>> tools that already exist in the Hadoop ecosystem (e.g. Distcp, Sqoop, Hive
>> etc). Apache Sqoop is primarily geared for bulk ingest especially from
>> databases which is one part of Gobblin’s feature set. Apache Flume focuses
>> primarily on streaming data movement. Finally, general purpose data
>> processing engines like Apache Flink, Apache Samza, and Apache Spark focus
>> on generic computation.
>> 
>> Gobblin design choices intersect with specific features in all of these
>> systems, however in aggregate, it is a different point in the design space.
>> It is designed to handle both streaming and batch data. It supports
>> execution through a standalone cluster mode as well as through existing
>> frameworks such as MR, Yarn, Hive, Samza etc allowing users to choose the
>> deployment model that is optimal for the specific data integration
>> challenge. It provides native optimized implementations for critical
>> integrations such as Kafka, Hadoop - Hadoop copies etc. Gobblin also
>> supports both Hadoop and non-Hadoop data, being able to ingest data into
>> Kafka as well as other key-value stores like Couchbase. Gobblin is also not
>> just a generic computation framework, it has specific constructs for data
>> integration patterns such as data quality metrics and policies. Gobblin’s
>> configuration management system allows it to be fully multi-tenant and take
>> advantage of grouped policies when required. For batch workloads, Gobblin
>> has a planning phase that provides for better resource utilization.
>> 
>> In summary, there is healthy diversity in the number of systems
>> approaching the interesting and pressing problem of big data integration.
>> We believe that Gobblin will provide another compelling choice in that
>> design space.
>> 
>> === An Excessive Fascination with the Apache Brand ===
>> 
>> Gobblin is already a healthy and well known open source project. This
>> proposal is not for the purpose of generating publicity. Rather, the
>> primary benefits to joining Apache are already outlined in the Rationale
>> section.
>> 
>> == Documentation ==
>> 
>> The reader will find these websites highly relevant:
>> * Website: http://linkedin.github.io/gobblin/
>> * Documentation: https://gobblin.readthedocs.io/en/latest/
>> * Codebase: https://github.com/linkedin/gobblin/
>> * User group: https://groups.google.com/forum/#!forum/gobblin-users
>> 
>> == Source and Intellectual Property Submission Plan ==
>> 
>> The Gobblin codebase is currently hosted on Github. This is the exact
>> codebase that we would migrate to the Apache foundation.The Gobblin source
>> code is already licensed under Apache License Version 2.0. Going forward,
>> we will continue to have all the contributions licensed directly to the
>> Apache foundation through our signed Individual Contributor License
>> Agreements for all the committers on the project.
>> 
>> == External Dependencies ==
>> 
>> To the best of our knowledge, all of Gobblin dependencies are distributed
>> under Apache compatible licenses. Upon acceptance to the incubator, we
>> would begin a thorough analysis of all transitive dependencies to verify
>> this fact and introduce license checking into the build and release process
>> (for instance integrating Apache Rat).
>> 
>> == Cryptography ==
>> 
>> We do not expect Gobblin to be a controlled export item due to the use of
>> encryption.
>> 
>> == Required Resources ==
>> 
>> === Mailing lists ===
>> 
>> * gobblin-user
>> * gobblin-dev
>> * gobblin-commits
>> * gobblin-private for private PMC discussions (with moderated
>> subscriptions)
>> 
>> === Subversion Directory ===
>> 
>> Git is the preferred source control system: git://git.apache.org/gobblin
>> 
>> === Issue Tracking ===
>> 
>> JIRA Gobblin (GOBBLIN)
>> 
>> === Other Resources ===
>> 
>> The existing code already has unit and integration tests, so we would
>> like a Jenkins instance to run them whenever a new patch is submitted. This
>> can be added after project creation.
>> 
>> == Initial Committers ==
>> 
>> * Abhishek Tiwari <abhishektiwari dot btech at gmail dot com>
>> * Shirshanka Das <shirshanka at apache dot org>
>> * Chavdar Botev <cbotev at gmail dot com>
>> * Sahil Takiar <takiar.sahil at gmail dot com>
>> * Yinan Li <liyinan926 at gmail dot com>
>> * Ziyang Liu <>
>> * Lorand Bendig <lbendig at gmail dot com>
>> * Issac Buenrostro <ibuenros at linkedin dot com>
>> * Hung Tran <hutran at linkedin dot com>
>> * Olivier Lamy <olamy at apache dot org>
>> * Jean-Baptiste Onofré <jbono...@apache.org>
>> 
>> == Affiliations ==
>> 
>> * Abhishek Tiwari - LinkedIn
>> * Shirshanka Das - LinkedIn
>> * Chavdar Botev - Stealth Startup
>> * Sahil Takiar - Cloudera
>> * Yinan Li - Google
>> * Ziyang Liu - Facebook
>> * Lorand Bendig - Swisscom
>> * Issac Buenrostro - LinkedIn
>> * Hung Tran - LinkedIn
>> * Olivier Lamy - Webtide
>> * Jean-Baptiste Onofre - Talend
>> 
>> == Sponsors ==
>> 
>> === Champion ===
>> 
>> Olivier Lamy < olamy at apache dot org>
>> 
>> === Nominated Mentors ===
>> 
>> * Olivier Lamy <olamy at apache dot org>
>> * Jean-Baptiste Onofre <jbonofre at apache dot org>
>> * ?
>> * ?
>> 
>> == Sponsoring Entity ==
>> The Apache Incubator
>> 
> 
> 
> 
> -- 
> Olivier Lamy
> http://twitter.com/olamy | http://linkedin.com/in/olamy



---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [discuss] Apache Gobblin Incubator Proposal

Reply via email to