Thanks for the proposal Jim. Regards JB
On Feb 15, 2017, 11:37, at 11:37, Jim Jagielski <j...@jagunet.com> wrote: >If you need/want another mentor, I volunteer > >> On Feb 14, 2017, at 3:53 PM, Olivier Lamy <ol...@apache.org> wrote: >> >> Hi >> Well I don't see issues as no one discuss the proposal. >> So I will start the official vote tomorrow. >> Cheers >> Olivier >> >> On 6 February 2017 at 14:08, Olivier Lamy <ol...@apache.org> wrote: >> >>> Hello everyone, >>> I would like to submit to you a proposal to bring Gooblin to the >Apache >>> Software Foundation. >>> The text of the proposal is included below and available as a draft >here >>> in the Wiki: https://wiki.apache.org/incubator/GobblinProposal >>> >>> We will appreciate any feedback and input. >>> >>> Olivier on behalf of the Gobblin community >>> >>> >>> = Apache Gobblin Proposal = >>> == Abstract == >>> Gobblin is a distributed data integration framework that simplifies >common >>> aspects of big data integration such as data ingestion, replication, >>> organization and lifecycle management for both streaming and batch >data >>> ecosystems. >>> >>> == Proposal == >>> >>> Gobblin is a universal data integration framework. The framework has >been >>> used to build a variety of big data applications such as ingestion, >>> replication, and data retention. The fundamental constructs provided >by the >>> Gobblin framework are: >>> >>> 1. An expandable set of connectors that allow data to be integrated >from >>> a variety of sources and sinks. The range of connectors already >available >>> in Gobblin are quite diverse and are an ever expanding set. To >highlight >>> just a few examples, connectors exist for databases (e.g., MySQL, >Oracle >>> Teradata, Couchbase etc.), web based technologies (REST APIs, >FTP/SFTP >>> servers, Filers), scalable storage (HDFS, S3, Ambry etc,), streaming >data >>> (Kafka, EventHubs etc.), and a variety of proprietary data sources >and >>> sinks (e.g.Salesforce, Google Analytics, Google Webmaster etc.). >Similarly, >>> Gobblin has a rich library of converters that allow for conversion >of data >>> from one format to another as data moves across system boundaries >(e.g. >>> AVRO in HDFS to JSON in another system). >>> >>> >>> 2. Gobblin has a well defined and customizable state management >layer >>> that allows writing stateful applications. These are particularly >useful >>> when solving problems like bulk incremental ingest and keeping >several >>> clusters replicated in sync. The ability to record work that has >been >>> completed and what remains in a scalable manner is critical to >writing such >>> diverse applications successfully. >>> >>> >>> 3. Gobblin is agnostic to the underlying execution engine. It can be >>> tailored to run ontop of a variety of execution frameworks ranging >from >>> multiple processes on a single node, to open source execution >engines like >>> MapReduce, Spark or Samza, natively on top of raw containers like >Yarn or >>> Mesos, and the public cloud like Amazon AWS or Microsoft Azure. We >are >>> extending Gobblin to run on top of a self managed cluster when >security is >>> vital. This allows different applications that require different >degrees >>> of scalability, latency or security to be customized to for their >specific >>> needs. For example, highly latency sensitive applications can be >executed >>> in a streaming environment while batch based execution might benefit >>> applications where the priority might be geared towards optimal >container >>> utilization. >>> >>> 4.Gobblin comes out of the box with several diagnosability features >like >>> Gobblin metrics and error handling. Collectively, these features >allow >>> Gobblin to operate at the scale of petabytes of data. To give just >one >>> example, the ability to quarantine a few bad records from an >isolated Kafka >>> topic without stopping the entire flow from continued execution is >vital >>> when the number of Kafka topics range in the thousands and the >collective >>> data handled is in the petabytes. >>> >>> Gobblin thus provides crisply defined software constructs that can >be used >>> to build a vast array of data integration applications customizable >for >>> varied user needs. It has become a preferred technology for data >>> integration use-cases by many organizations worldwide (see a partial >list >>> here). >>> >>> == Background == >>> >>> Over the last decade, data integration has evolved use case by use >case in >>> most companies. For example, at LinkedIn, when Kafka became a >significant >>> part of the data ecosystem, a system called Camus was built to >ingest this >>> data for analytics processing on Hadoop. Similarly, we had custom >pipelines >>> to ingest data from Salesforce, Oracle and myriad other sources. >This >>> pattern became the norm rather than the exception and one point, >LinkedIn >>> was running at least fifteen different types of ingestion pipelines. >This >>> fragmentation has several unfortunate implications. Operational >costs scale >>> with the number of pipelines even if the myriad pipelines share a >vasty >>> array of common features. Bug fixes and performance optimizations >cannot be >>> shared across the pipelines. A common set of practices around >debugging and >>> deployment does not emerge. Each pipeline operator will continue to >invest >>> in his little silo of the data integration world completely >oblivious to >>> the challenges of his fellow operator sitting five tables down. >>> >>> These experiences were the genesis behind the design and >implementation of >>> Gobblin. Gobblin thus started out as a universal data ingestion >framework >>> focussed on extracting, transforming, and synchronizing large >volumes of >>> data between different data sources and sinks. Not surprisingly, >given its >>> origins, the initial design of Gobblin placed great emphasis on >>> abstractions that can be leveraged repeatedly. These abstractions >have >>> stood the test of time at LinkedIn and we have been able to leverage >the >>> constructs well beyond ingest. Gobblin's architecture has allowed us >at >>> LinkedIn to use it for a variety of applications ranging from from >optimal >>> format conversion to adhering to compliance policies set by European >>> standards. Finally, as noted earlier, Gobblin can be deployed in a >variety >>> of execution environments: it can be deployed as a library embedded >in >>> another application or can be used to execute jobs on a public >cloud. A >>> fluid architectural and execution design story has allowed Gobblin >to >>> become a truly successful data integration platform. >>> >>> Gobblin has continued to evolve with a variety of utility packages >like >>> Gobblin metrics and Gobblin config management. Collectively, these >allow >>> organizations utilizing Gobblin to use a system that has been battle >tested >>> at LinkedIn scale. This is something that its consumers have to come >to >>> appreciate greatly. >>> >>> == Rationale == >>> >>> Gobblin's entry to the Apache foundation is beneficial to both the >Gobblin >>> and the Apache communities. Gobblin has greatly benefited from its >open >>> source roots. Its community and adoption has grown greatly as a >result. >>> More importantly, the feedback from the community whether through >>> interactions at meetups or through the mailing list have allowed for >a rich >>> exchange of ideas. In order to grow up the Gobblin community and >improve >>> the project, we would like to propose Gobblin to the Apache >incubator. The >>> Gobblin community will greatly benefit from the established >development and >>> consensus processes that have worked well for other projects. The >Apache >>> process has served many other open source projects well and we >believe that >>> the Gobblin community will greatly benefit from these practices as >well. >>> >>> == Initial Goals == >>> >>> Migrate the existing codebase to Apache >>> Study and Integrate with the Apache development process >>> Ensure all dependencies are compliant with Apache License version >2.0 >>> Incremental development and releases per Apache guidelines >>> Improve the relationship between Gobblin and other Apache projects >>> >>> == Current Status == >>> >>> Gobblin has undergone five major releases (0.5, 0.6, 0.7, 0.8, 0.9) >and >>> many minor ones. The latest version, Gobblin 0.9 has just been >released in >>> December, 2016. Gobblin is being used in production by over 20 >>> organizations. Gobblin codebase is currently hosted at github.com, >which >>> will seed the Apache git repository. >>> >>> === Meritocracy === >>> >>> We plan to invest in supporting a meritocracy. We will discuss the >>> requirements in an open forum. Several companies have already >expressed >>> interest in this project, and we intend to invite additional >developers to >>> participate. We will encourage and monitor community participation >so that >>> privileges can be extended to those that contribute. >>> >>> === Community === >>> >>> The need for a extensible and flexible data integration platform in >the >>> open source is tremendous. Gobblin is currently being used by at >least 20 >>> organizations worldwide (some examples are listed here). By bringing >>> Gobblin into Apache, we believe that the community will grow even >bigger. >>> >>> === Core Developers === >>> >>> Gobblin was started by engineers at LinkedIn, and now has developers >from >>> Google, Facebook, LinkedIn, Cloudera, Nerdwallet, Swisscom, and many >other >>> companies. >>> >>> === Alignment === >>> >>> Gobblin aligns exceedingly well with the Apache ecosystem. Gobblin >is >>> built leveraging several existing Apache projects (Apache Helix, >Yarn, >>> Zookeeper etc.). As Gobblin matures, we expect to leverage several >other >>> Apache projects further. This leverage invariably results in >contributions >>> back to these projects (e.g., a contribution to Helix was made >during the >>> Gobblin Yarn development). Finally, being an integration platform, >it >>> serves as a bridge between several Apache projects like Apache >Hadoop and >>> Apache Kafka. This integration is highly desired and their >interaction >>> through Gobblin will lead to a virtuous cycle of greater adoption >and newer >>> features in these projects. Thus, we believe that it will be a nice >>> addition to the current set of big data projects under the auspices >of the >>> Apache foundation. >>> >>> == Known Risks == >>> >>> === Orphaned Products === >>> >>> The risk of the Gobblin project being abandoned is minimal. As noted >>> earlier, there are many organizations that have already invested in >Gobblin >>> significantly and are thus incentivized to continue development. >Many of >>> these organizations operate critical data ingest, compliance and >retention >>> pipelines built with Gobblin and are thus heavily invested in the >continued >>> success of Gobblin. >>> >>> === Inexperience with Open Source === >>> >>> Gobblin has existed as a healthy open source project for several >years. >>> During that time, we have curated an open-source community >successfully. >>> Any risks that we foresee are ones associated with scaling our open >source >>> communication and operation process rather than with inherent >inexperience >>> in operating an open source project. >>> >>> === Homogenous Developers === >>> >>> Gobblin’s committers are employed by companies of varying sizes and >>> industry. Committers come from well heeled internet companies like >Google, >>> LinkedIn and Facebook. We also have developers from traditional >enterprise >>> companies like SwissCom. Well funded startups like Nerdwallet are >active in >>> the community of developers. We plan to double our efforts in >cultivating >>> a diverse set of committers for Gobblin. >>> >>> === Reliance on Salaried Developers === >>> >>> It is expected that Gobblin development will occur on both salaried >time >>> and on volunteer time, after hours. The majority of initial >committers are >>> paid by their employer to contribute to this project. However, they >are all >>> passionate about the project, and we are confident that the project >will >>> continue even if no salaried developers contribute to the project. >We are >>> committed to recruiting additional committers including non-salaried >>> developers. >>> >>> === Relationships with Other Apache Products === >>> >>> As noted earlier, Gobblin leverages several open source projects and >>> contributes back to them. There is also overlap with aspects of >other >>> Apache projects that we will discuss briefly here. Apache Nifi, like >>> Gobblin aspires to reduce the operational overhead arising from data >>> heterogeneity. Apache Nifi is structured as a visual flow based >approach >>> and provides built-in constructs for buffering data, prioritizing >data, and >>> understanding data lineage as data flows across systems. Apache Nifi >has >>> its own dataflow based execution engine with buffering, scheduling >and >>> streaming capabilities. Apache Falcon is a Hadoop centric data >governance >>> engine for defining, scheduling, and monitoring data management >policies >>> through flow definition typically for data that has been ingested >into >>> Hadoop already. Apache Falcon generally delegates data management >jobs to >>> tools that already exist in the Hadoop ecosystem (e.g. Distcp, >Sqoop, Hive >>> etc). Apache Sqoop is primarily geared for bulk ingest especially >from >>> databases which is one part of Gobblin’s feature set. Apache Flume >focuses >>> primarily on streaming data movement. Finally, general purpose data >>> processing engines like Apache Flink, Apache Samza, and Apache Spark >focus >>> on generic computation. >>> >>> Gobblin design choices intersect with specific features in all of >these >>> systems, however in aggregate, it is a different point in the design >space. >>> It is designed to handle both streaming and batch data. It supports >>> execution through a standalone cluster mode as well as through >existing >>> frameworks such as MR, Yarn, Hive, Samza etc allowing users to >choose the >>> deployment model that is optimal for the specific data integration >>> challenge. It provides native optimized implementations for critical >>> integrations such as Kafka, Hadoop - Hadoop copies etc. Gobblin also >>> supports both Hadoop and non-Hadoop data, being able to ingest data >into >>> Kafka as well as other key-value stores like Couchbase. Gobblin is >also not >>> just a generic computation framework, it has specific constructs for >data >>> integration patterns such as data quality metrics and policies. >Gobblin’s >>> configuration management system allows it to be fully multi-tenant >and take >>> advantage of grouped policies when required. For batch workloads, >Gobblin >>> has a planning phase that provides for better resource utilization. >>> >>> In summary, there is healthy diversity in the number of systems >>> approaching the interesting and pressing problem of big data >integration. >>> We believe that Gobblin will provide another compelling choice in >that >>> design space. >>> >>> === An Excessive Fascination with the Apache Brand === >>> >>> Gobblin is already a healthy and well known open source project. >This >>> proposal is not for the purpose of generating publicity. Rather, the >>> primary benefits to joining Apache are already outlined in the >Rationale >>> section. >>> >>> == Documentation == >>> >>> The reader will find these websites highly relevant: >>> * Website: http://linkedin.github.io/gobblin/ >>> * Documentation: https://gobblin.readthedocs.io/en/latest/ >>> * Codebase: https://github.com/linkedin/gobblin/ >>> * User group: https://groups.google.com/forum/#!forum/gobblin-users >>> >>> == Source and Intellectual Property Submission Plan == >>> >>> The Gobblin codebase is currently hosted on Github. This is the >exact >>> codebase that we would migrate to the Apache foundation.The Gobblin >source >>> code is already licensed under Apache License Version 2.0. Going >forward, >>> we will continue to have all the contributions licensed directly to >the >>> Apache foundation through our signed Individual Contributor License >>> Agreements for all the committers on the project. >>> >>> == External Dependencies == >>> >>> To the best of our knowledge, all of Gobblin dependencies are >distributed >>> under Apache compatible licenses. Upon acceptance to the incubator, >we >>> would begin a thorough analysis of all transitive dependencies to >verify >>> this fact and introduce license checking into the build and release >process >>> (for instance integrating Apache Rat). >>> >>> == Cryptography == >>> >>> We do not expect Gobblin to be a controlled export item due to the >use of >>> encryption. >>> >>> == Required Resources == >>> >>> === Mailing lists === >>> >>> * gobblin-user >>> * gobblin-dev >>> * gobblin-commits >>> * gobblin-private for private PMC discussions (with moderated >>> subscriptions) >>> >>> === Subversion Directory === >>> >>> Git is the preferred source control system: >git://git.apache.org/gobblin >>> >>> === Issue Tracking === >>> >>> JIRA Gobblin (GOBBLIN) >>> >>> === Other Resources === >>> >>> The existing code already has unit and integration tests, so we >would >>> like a Jenkins instance to run them whenever a new patch is >submitted. This >>> can be added after project creation. >>> >>> == Initial Committers == >>> >>> * Abhishek Tiwari <abhishektiwari dot btech at gmail dot com> >>> * Shirshanka Das <shirshanka at apache dot org> >>> * Chavdar Botev <cbotev at gmail dot com> >>> * Sahil Takiar <takiar.sahil at gmail dot com> >>> * Yinan Li <liyinan926 at gmail dot com> >>> * Ziyang Liu <> >>> * Lorand Bendig <lbendig at gmail dot com> >>> * Issac Buenrostro <ibuenros at linkedin dot com> >>> * Hung Tran <hutran at linkedin dot com> >>> * Olivier Lamy <olamy at apache dot org> >>> * Jean-Baptiste Onofré <jbono...@apache.org> >>> >>> == Affiliations == >>> >>> * Abhishek Tiwari - LinkedIn >>> * Shirshanka Das - LinkedIn >>> * Chavdar Botev - Stealth Startup >>> * Sahil Takiar - Cloudera >>> * Yinan Li - Google >>> * Ziyang Liu - Facebook >>> * Lorand Bendig - Swisscom >>> * Issac Buenrostro - LinkedIn >>> * Hung Tran - LinkedIn >>> * Olivier Lamy - Webtide >>> * Jean-Baptiste Onofre - Talend >>> >>> == Sponsors == >>> >>> === Champion === >>> >>> Olivier Lamy < olamy at apache dot org> >>> >>> === Nominated Mentors === >>> >>> * Olivier Lamy <olamy at apache dot org> >>> * Jean-Baptiste Onofre <jbonofre at apache dot org> >>> * ? >>> * ? >>> >>> == Sponsoring Entity == >>> The Apache Incubator >>> >> >> >> >> -- >> Olivier Lamy >> http://twitter.com/olamy | http://linkedin.com/in/olamy > > >--------------------------------------------------------------------- >To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >For additional commands, e-mail: general-h...@incubator.apache.org