+1 ( binding ) Interesting project. Please add me as a mentor to the project.
On Tue, Sep 8, 2020 at 3:26 PM Matt Casters <matt.cast...@neotechnology.com.invalid> wrote: > Hello Apache, > > Our community is eager to propose for Hop to join the Apache Incubator. > The Hop Orchestration Platform aims to help people with complex data and > metadata orchestration problems. > > Below is the complete text of the proposal but you can also find it here: > https://cwiki.apache.org/confluence/display/INCUBATOR/HopProposal > > Any help with respect to the incubation is appreciated including help from > a few more mentors to set us on the right track. On behalf of my community > I'd be happy to answer any questions you might have regarding Hop. Our > thanks go out to Max, Julian and Tom for helping us set up this proposal. > > Thanks in advance for your time! > > Best regards, > > Matt - Hop co-founder > www.project-hop.org > --- > > Abstract > ========= > Hop is short for the Hop Orchestration Platform. Written completely in Java > it aims to provide a wide range of data orchestration tools, including a > visual development environment, servers, metadata analysis, auditing > services and so on. As a platform Hop also wants to be a re-usable library > so that it can be easily re-used by other software. > > Proposal > ========= > Hop provides all the tools to build, maintain and deploy data > orchestration, ETL and data integration solutions. For example, Hop allows > you to diagram a data flow that propagates changes from a database via > Apache Kafka to a data warehouse and deploy it as an Apache Beam pipeline. > The core concepts of Hop are Pipelines and Workflows. > * Pipelines do the core data manipulation work (read, manipulate, write > data). The main items of work in pipelines are transforms. A pipeline > consists of two or more (usually many) transforms that each perform a > granular piece of work. The transforms in a pipeline run in parallel, and > together create a powerful data processing tool. > * Workflows take care of the orchestration of actions: execute pipelines, > run child workflows, environment checks, preparation, problem alerting and > so on. > If these terms sound familiar it’s because they are taken from the Apache > Beam and Apache Airflow projects. > > > The main components of the Hop platform are: > * hop-gui, a visual data orchestration IDE > * hop-run: a CLI tool to run workflows or pipelines > * hop-config: a CLI tool to configure Hop and its components > * hop-server: a light-weight web server to run and monitor workflows and > pipelines > * hop-translator: a tool for translating the various parts of the Hop tools > (i18n). > * hop-web: a thin client version of hop-gui for web browsers and mobile > devices > > > The cornerstone of the Hop platform is extensibility: all major components > of the platform are designed to be pluggable. This allows any possible > missing functionality to be created in a short amount of time. > > Background > =========== > The Hop Orchestration Platform has its origins in the Kettle community. > Kettle got acquired by Pentaho and after Pentaho’s acquisition by Hitachi > in 2015, the community struck out to solve problems less aligned with > Hitachi’s interests. > > Rationale > ========== > In the Hop community, we have always aimed to function as a meritocracy, > where contributions are accepted based on merit, and individuals gain > status in the community based on their contributions (coding and > otherwise). We’re proud to have a diverse group of people doing all the > required things in a project: development , documentation, tutorials, > architecture, testing, graphics design and much more. Bringing the project > under the Apache Software Foundation would allow us to continue and grow, > but also give our users confidence about the governance, IP status, and > future of the project. > > ASF Preparation Phase > ====================== > The very first goal of project Hop is to find a good way to cooperate on > the development across wide geographical, economical and social spectra. To > make this possible real changes were needed to a codebase which is > essentially 20 years old. Most of these changes have been tackled by now. > We think it’s fair to say that by now, Hop is a new platform even though it > shares a common background as it partly started from the Kettle code base. > Here are a few of the key focus areas we’re trying to saveguard going > forward: > * Plugins: lightweight plugins for all major functionality. This makes it > possible to extend Hop or reduce Hop in size. It also allows people to > implement or change functionality with minimal coding. In other words it > makes it easier to contribute. > * Maintain an open and responsive community where every concern, feedback > and contribution is welcome. > * Maintain a clear focus on data orchestration user requirements, not on > “industry trends” > * Documentation: we set up a version controlled “adoc” system with > automated builds which is both open, controlled and reviewed. This is > incredibly important for every Hop user and developer. > * Testing and stability: we want to massively increase stability by > implementing integration tests beyond the standard Java unit testing > because of the dynamic nature of data orchestration work. We still have a > long way to go. This work will never be finished. It’s a clear and > important goal nevertheless. > * Simplicity: things are complex enough. We follow the example of projects > like Apache Spark and Flink and so as an example “hop-run.sh” does exactly > what the name says without the need to dive into documentation. As much as > possible we make things self-evident and will re-use existing terminology. > > > For a list of the changes you can look at the monthly roundup which was > compiled since February 2020. It documents to hard work of our community > so far: > > > http://www.project-hop.org/news/roundup-2020-02/ > http://www.project-hop.org/news/roundup-2020-03/ > http://www.project-hop.org/news/roundup-2020-04/ > http://www.project-hop.org/news/roundup-2020-05/ > http://www.project-hop.org/news/roundup-2020-06/ > http://www.project-hop.org/news/roundup-2020-08/ > > > Goals > ====== > Here are a few more details and specifics of things we still want to take > on going forward: > * Add more plugin metadata to Transforms and Action plugins as well as > their supported engines. This will make it easier to refine the user > interface and make the user experience better by giving to the point > feedback on what operations are supported and required. Example metadata > to add: extra version and build information, dependencies, tags and labels > (replacing categories), documentation links, input and output capabilities, > engine capabilities and so on. > * SWT: While the Eclipse SWT project is still supported we want to make a > list of all the commonly used API calls and stick to those with our own > API. This will help the development of hop-web and allow us to possibly > more easily migrate to different user interfaces later on. > * Integration testing: every transform and action should have an > integration test before it is released to ensure quality. Java unit > testing has been proven to be insufficient in guarding against backward > compatibility, stability and functionality. We need to do better. > * Apache VFS: Hop makes extensive use of this API to handle files. As such > we want to implement the various drivers for gs://, hdfs://, s3:// through > standard Kettle plugins making it easier to choose which protocols to > support. > * Variables & Parameters: make this experience more intuitive, clean up > the underlying API and add more options to the various user interfaces > responsible for setting and passing variables and parameters. > * Make Hop-Web an integral part of the Apache Hop project removing the code > duplication (fork) we’re dealing with now. This includes the need to > improve various user interfaces which were designed for non-web clients. > * Make best practices and governance functionality an integral part of the > API of the project: > * Data sets and unit testing (already done) > * Environments and lifecycle management (partly done) > * Git support (partly done) > * Auditing and lineage > * Software policies and enforcement thereof > * Configuration management (partly done) > > > Current Status > =============== > > Meritocracy > ------------ > With Project Hop, we actively work to foster the existing community and > encourage community contributions. As of September 1st 2020 we received > over 250 pull requests and have around 600 tickets in our JIRA platform (a > lot of which were created by community members) and have active discussions > in our Mattermost chat platform with over 80 members. > > > The last half year we started to ask users on our chat chat server for > specific feedback on terminology, features and so on. It’s been a > wonderfully positive experience to have in-depth discussions on complex > issues with industry experts. We look forward to moving these discussions > and votes to an Apache mailing list. > > Community > ------------ > Hop is developed, extended and maintained by a global community of users > and developers. The Hop community is what has driven its development and > growth. > The particular past history of Hop has led to a lot of interest for the > project and already led to a number of contributions, documentation and > translations. > > Core Developers > ---------------- > We have a diverse group of core developers with people joining on a regular > basis. Matt Casters, Rodrigo Haces and David Rosenblum are part time > developers on Hop, salaried by Neo Solutions. Bart Maertens, Hans Van > Akelyen, Yannick Mols are part time Hop developers paid for by company > know.bi. Doug and Gretchen Moran were Pentaho employees but along with > Rafael Valenzuela, Dan Keeley, Jason Chu, Sergio Ramazzina and many others > they can be considered to be long time consultants and community members > for over a decade that joined the Hop community in the last year or two. > > > Alignment > ---------- > We want to anchor and safeguard our development and community building > efforts for the future. We strongly believe that as an Apache project this > can be achieved in the best possible way. The Hop project also started to > align with projects like Apache Beam, Spark and Flink in it's use of > terminology, tools, manner of configuration and so on. As mentioned > elsewhere in this document Hop is a large user of other Apache projects and > libraries and we believe that becoming an Apache project is beneficial. > Specifically for Apache Beam we believe that providing a visual pipeline > development tool can be of great value. > > Known Risks > ============ > While the current code-base of Kettle on which we have started from is > already released under the Apache Public License 2.0 proper attribution > needs to happen to Hitachi Vantara. > We have no knowledge of existing patents on any part of the Kettle > codebase. > To further reduce any risk of there even being any discussion on naming the > Hop team decided to rename the project, its tools (to be more self-evident > as well), the java API and even the main concepts (Transformations are now > called Pipelines, in line with Apache Beam naming conventions). > > Orphaned products > ------------------ > There is little risk that the project will become orphaned. The list of > active developers is large, and consists of a mix of developers who have > been working on the code for several years and recent arrivals in the > community. > > Inexperience with Open Source > ------------------------------ > The project team has a long history in open source and has contributed to > Apache licensed open source projects, mostly in the Kettle ecosystem such > as Kettle itself and the many plugins and projects surrounding it. The > experience gained there has allowed us to quickly set up all required build > tools and processes. In its fairly short history, Hop has been advocating > open source in all aspects of the project. Our submission to the Apache > Software Foundation is a logical extension of our commitment to open source > software. > > Licensing > ---------- > The original source code we started from (see below) has been open source > since december 2005, initially under the Lesser GPL but since January 2012 > all under the Apache License version 2.0. All Hop code has been scanned for > compliance with APL 2.0. We integrated Apache Rat with our build process. > > Heterogeneous Developers > ------------------------- > Hop is built, developed and maintained by a global community of > developers. Input comes from a large group of developers and users from > all over the world. At this moment over 7 companies contribute to Hop > through the developers along with a list of individuals and consultants. > > Reliance on Salaried Developers > -------------------------------- > Hop developers are a mix of volunteers, enthusiasts and people working for > an employer. There is also a group of consultants who want to be involved > in Hop because it allows them to do projects with it. They are in fact our > most important users and developers since they provide valuable feedback > from the trenches. > > Relationships with Other Apache Products > ----------------------------------------- > Hop is a heavy user of Apache software libraries. > > Apache Commons usage: > * commons-beanutils > * commons-cli > * commons-codec > * commons-collections > * commons-collections4 > * commons-compiler > * commons-compress > * commons-configuration > * commons-database-model > * commons-dbcp > * commons-digester > * commons-el > * commons-httpclient > * commons-io > * commons-lang and commons-lang3 > * commons-logging > * commons-math and commons-math3-3.5.jar > * commons-net > * commons-pool > * commons-validator > * commons-vfs2 > > > Other libraries: > * Apache Batik : for the front-end SVG drawing > * Apache Xerces (XSLT, XML processing) > > > Other usage of Apache projects related to Hop (plugins): > * Apache Avro > * Apache Beam w/ Apache Spark, Apache Flink, … > * Apache Cassandra > * Apache CouchDB > * Apache Derby > * Apache Flume > * Apache Hadoop > * Apache Hive > * Apache Kafka > * Apache Solr > * Apache Subversion > * Apache Zookeeper > > > For the build process > * Apache Maven > * Apache Jenkins > > An excessive Fascination with the Apache Brand > ----------------------------------------------- > With this proposal we are not seeking attention or publicity. Rather, we > firmly believe in Hop, visual data pipeline development and the ability to > treat the developed data pipelines (ETL) as software code. While the > original Hop code has been open source for about 15 years, we believe > putting code on GitHub can only go so far. We see the Apache community, > processes, and mission as critical for ensuring Hop is truly > community-driven, positively impactful, and innovative open source > software. We believe Hop is a great fit for the Apache Software Foundation > due to its focus on visual data processing and its relationships to > existing ASF projects. > > Documentation > ============== > Over the years, the community has contributed extensive documentation to > wiki.pentaho.com. Over time, areas of the available information have > become > incomplete or outdated. Most of this documentation has been reviewed, > updated and will be contributed to the Apache foundation with the Hop > source code. Documentation for the extensive new functionality that was > added to Hop in recent months is being written. > We consider documentation to be a core piece of the Hop platform and will > treat documentation as any other item of code. > > Initial Source > =============== > While there isn’t a Java class in Hop which is unchanged from its origins > we should mention we selected this source code to form the base of Apache > Kettle: > https://github.com/pentaho/pentaho-kettle/tree/8.2.0.7-R > > We merged various changes from the WebSpoon fork found over here: > https://github.com/HiromuHota/pentaho-kettle > > > Various community driven Kettle plugins were written to bypass bugs, slow > down code-rot and to implement missing features. They were were merged > into Hop from these locations: > https://github.com/mattcasters/kettle-debug-plugin (better debugging) > https://github.com/mattcasters/kettle-beam (Apache Beam support) > https://github.com/mattcasters/pentaho-pdi-dataset (Unit Testing) > https://github.com/mattcasters/kettle-needful-things (Bug fixes & > workarounds) > https://github.com/mattcasters/kettle-environment (Environment management) > > > The Hop repositories are currently hosted at: > https://github.com/project-hop/ > * Hop: source code for the Hop project > * Hop-doc: technical documentation for the Hop project > * Hop-website: Hop website and content repository > * Hop-docker: Docker containers, Kubernetes > > Source and Intellectual Property Submission Plan > ================================================= > The originating source code is already licensed under an Apache 2 license: > * https://github.com/pentaho/pentaho-kettle/blob/8.2.0.7-R/LICENSE.txt > * > https://github.com/HiromuHota/pentaho-kettle/blob/webspoon-8.3/LICENSE.txt > * https://github.com/mattcasters/kettle-debug-plugin/blob/master/LICENSE > * https://github.com/mattcasters/kettle-beam/blob/master/LICENSE > * > https://github.com/mattcasters/pentaho-pdi-dataset/blob/master/LICENSE.txt > * https://github.com/mattcasters/kettle-needful-things/blob/master/LICENSE > * https://github.com/mattcasters/kettle-environment/blob/master/LICENSE > > > For all contributions we have an agreement in place: > https://cla-assistant.io/project-hop/hop > > External Dependencies > ====================== > Over the course of the last year we removed non-essential dependencies as > much as possible and replaced them by interfaces and plugin types. We did > this to simplify the architecture. > It’s important to note all external dependencies are licensed under an > Apache 2.0 or Apache-compatible license. As we grow the Hop community we > will configure our build process to require and validate all contributions > and dependencies are licensed under the Apache 2.0 license or are under an > Apache-compatible license. > > Cryptography > ============= > > Required Resources > =================== > > Mailing lists > -------------- > We currently use a mix of email and Mattermost. We will migrate our > existing mailing lists to the following: > > d...@hop.incubator.apache.org > u...@hop.incubator.apache.org > priv...@hop.incubator.apache.org > comm...@hop.incubator.apache.org > > Git Repository > --------------- > The Hop code is currently in git, we’d like to keep it that way. We request > a git repository for incubator-hop with mirroring to GitHub. > > Issue Tracking > --------------- > We request the creation of an Apache-hosted JIRA. > > Jira ID: HOP > > > Other Resources > ---------------- > To allow other projects to use Hop as a library we would love to publish > artifacts on a Maven server like maven.apache.org. > > Initial Committers > =================== > * Nicholas Adment <nadm...@gmail.com> > * Hans Van Akelyen <hans.van.akel...@know.bi> > * Lokke Bruyndonckx <lokke.bruyndon...@know.bi> > * Matt Casters <matt.cast...@neo4j.com> > * Jason Chu <jianjun...@gmail.com> > * Peter Fabricius <i...@peter-fabricius.de> > * Rodrigo Haces <rodrigo.ha...@neo4j.com> > * Dave Henry <dshenr...@gmail.com> > * Hiromu Hota <hiromu.h...@gmail.com> > * Brandon Jackson <usbran...@gmail.com> > * Dan Keeley <d...@dankeeley.co.uk> > * Bart Maertens <bart.maert...@know.bi> > * Yannick Mols <yannick.m...@know.bi> > * Doug Moran <d...@dougandgretchen.com> > * Gretchen Moran <gretc...@dougandgretchen.com> > * Sergio Ramazzina <sergio.ramazz...@serasoft.it> > * Maria Carina Roldan <maria.carina.rol...@gmail.com> > * David Rosenblum <david.rosenb...@neo4j.com> > * Rafael Valenzuela <rav...@gmail.com> > > Affiliations > ============= > * Neo4J > * Matt Casters > * Rodrigo Haces > * David Rosenblum > * Know.bi > * Bart Maertens > * Hans Van Akelyen > * Lokke Bruyndonckx > * Yannick Mols > * eHealth Africa > * Doug & Gretchen Moran > * Schemetrica > * Dave Henry > * Beijing Auphi Data Co > * Jason Chu > * Serasoft Italy > * Sergio Ramazzina > * Hitachi Research > * Hiromu Hota > > > Sponsors > ========= > Champion > --------- > Maximilian Michels (m...@apache.org) > > Nominated Mentors > ------------------ > Tom Barber (magicaltr...@apache.org) > Julian Hyde (jh...@apache.org) > Maximilian Michels (m...@apache.org) > > Sponsoring Entity > ================== > The Apache Incubator >