Re: [DISCUSS] Apache Dataflow Incubator Proposal

Alexander Bezzubov Thu, 21 Jan 2016 08:09:56 -0800

Hi,

it's great to see DataFlow becoming part to Apache ecosystem, thank you
bringing it in.
I would be happy to get involved and help.


--
Alex

On Thu, Jan 21, 2016 at 8:42 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Perfect: done, you are on the proposal.
>
> Thanks !
> Regards
> JB
>
>
> On 01/21/2016 11:55 AM, chatz wrote:
>
>> Charitha Elvitigala
>>
>> On 21 January 2016 at 16:17, Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>>
>> Hi Chatz,
>>>
>>> sure, what name should I use on the proposal, Charitha ?
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 01/21/2016 11:32 AM, chatz wrote:
>>>
>>> Hi Jean,
>>>>
>>>> I’d be interested in contributing as well.
>>>>
>>>> Thanks,
>>>>
>>>> Chatz
>>>>
>>>>
>>>> On 21 January 2016 at 14:22, Jean-Baptiste Onofré <j...@nanthrax.net>
>>>> wrote:
>>>>
>>>> Sweet: you are on the proposal ;)
>>>>
>>>>>
>>>>> Thanks !
>>>>> Regards
>>>>> JB
>>>>>
>>>>>
>>>>> On 01/21/2016 08:55 AM, Byung-Gon Chun wrote:
>>>>>
>>>>> This looks very interesting. I'm interested in contributing.
>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>> -Gon
>>>>>>
>>>>>> ---
>>>>>> Byung-Gon Chun
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 21, 2016 at 1:32 AM, James Malone <
>>>>>> jamesmal...@google.com.invalid> wrote:
>>>>>>
>>>>>> Hello everyone,
>>>>>>
>>>>>>
>>>>>>> Attached to this message is a proposed new project - Apache
>>>>>>> Dataflow, a
>>>>>>> unified programming model for data processing and integration.
>>>>>>>
>>>>>>> The text of the proposal is included below. Additionally, the
>>>>>>> proposal
>>>>>>> is
>>>>>>> in draft form on the wiki where we will make any required changes:
>>>>>>>
>>>>>>> https://wiki.apache.org/incubator/DataflowProposal
>>>>>>>
>>>>>>> We look forward to your feedback and input.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> James
>>>>>>>
>>>>>>> ----
>>>>>>>
>>>>>>> = Apache Dataflow =
>>>>>>>
>>>>>>> == Abstract ==
>>>>>>>
>>>>>>> Dataflow is an open source, unified model and set of
>>>>>>> language-specific
>>>>>>> SDKs
>>>>>>> for defining and executing data processing workflows, and also data
>>>>>>> ingestion and integration flows, supporting Enterprise Integration
>>>>>>> Patterns
>>>>>>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
>>>>>>> simplify
>>>>>>> the mechanics of large-scale batch and streaming data processing and
>>>>>>> can
>>>>>>> run on a number of runtimes like Apache Flink, Apache Spark, and
>>>>>>> Google
>>>>>>> Cloud Dataflow (a cloud service). Dataflow also brings DSL in
>>>>>>> different
>>>>>>> languages, allowing users to easily implement their data integration
>>>>>>> processes.
>>>>>>>
>>>>>>> == Proposal ==
>>>>>>>
>>>>>>> Dataflow is a simple, flexible, and powerful system for distributed
>>>>>>> data
>>>>>>> processing at any scale. Dataflow provides a unified programming
>>>>>>> model, a
>>>>>>> software development kit to define and construct data processing
>>>>>>> pipelines,
>>>>>>> and runners to execute Dataflow pipelines in several runtime engines,
>>>>>>> like
>>>>>>> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
>>>>>>> used
>>>>>>> for a variety of streaming or batch data processing goals including
>>>>>>> ETL,
>>>>>>> stream analysis, and aggregate computation. The underlying
>>>>>>> programming
>>>>>>> model for Dataflow provides MapReduce-like parallelism, combined with
>>>>>>> support for powerful data windowing, and fine-grained correctness
>>>>>>> control.
>>>>>>>
>>>>>>> == Background ==
>>>>>>>
>>>>>>> Dataflow started as a set of Google projects focused on making data
>>>>>>> processing easier, faster, and less costly. The Dataflow model is a
>>>>>>> successor to MapReduce, FlumeJava, and Millwheel inside Google and is
>>>>>>> focused on providing a unified solution for batch and stream
>>>>>>> processing.
>>>>>>> These projects on which Dataflow is based have been published in
>>>>>>> several
>>>>>>> papers made available to the public:
>>>>>>>
>>>>>>> * MapReduce - http://research.google.com/archive/mapreduce.html
>>>>>>>
>>>>>>> * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>>>>>>>
>>>>>>> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
>>>>>>>
>>>>>>> * MillWheel - http://research.google.com/pubs/pub41378.html
>>>>>>>
>>>>>>> Dataflow was designed from the start to provide a portable
>>>>>>> programming
>>>>>>> layer. When you define a data processing pipeline with the Dataflow
>>>>>>> model,
>>>>>>> you are creating a job which is capable of being processed by any
>>>>>>> number
>>>>>>> of
>>>>>>> Dataflow processing engines. Several engines have been developed to
>>>>>>> run
>>>>>>> Dataflow pipelines in other open source runtimes, including a
>>>>>>> Dataflow
>>>>>>> runner for Apache Flink and Apache Spark. There is also a “direct
>>>>>>> runner”,
>>>>>>> for execution on the developer machine (mainly for dev/debug
>>>>>>> purposes).
>>>>>>> Another runner allows a Dataflow program to run on a managed service,
>>>>>>> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java
>>>>>>> SDK
>>>>>>> is
>>>>>>> already available on GitHub, and independent from the Google Cloud
>>>>>>> Dataflow
>>>>>>> service. Another Python SDK is currently in active development.
>>>>>>>
>>>>>>> In this proposal, the Dataflow SDKs, model, and a set of runners will
>>>>>>> be
>>>>>>> submitted as an OSS project under the ASF. The runners which are a
>>>>>>> part
>>>>>>> of
>>>>>>> this proposal include those for Spark (from Cloudera), Flink (from
>>>>>>> data
>>>>>>> Artisans), and local development (from Google); the Google Cloud
>>>>>>> Dataflow
>>>>>>> service runner is not included in this proposal. Further references
>>>>>>> to
>>>>>>> Dataflow will refer to the Dataflow model, SDKs, and runners which
>>>>>>> are
>>>>>>> a
>>>>>>> part of this proposal (Apache Dataflow) only. The initial submission
>>>>>>> will
>>>>>>> contain the already-released Java SDK; Google intends to submit the
>>>>>>> Python
>>>>>>> SDK later in the incubation process. The Google Cloud Dataflow
>>>>>>> service
>>>>>>> will
>>>>>>> continue to be one of many runners for Dataflow, built on Google
>>>>>>> Cloud
>>>>>>> Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will
>>>>>>> develop against the Apache project additions, updates, and changes.
>>>>>>> Google
>>>>>>> Cloud Dataflow will become one user of Apache Dataflow and will
>>>>>>> participate
>>>>>>> in the project openly and publicly.
>>>>>>>
>>>>>>> The Dataflow programming model has been designed with simplicity,
>>>>>>> scalability, and speed as key tenants. In the Dataflow model, you
>>>>>>> only
>>>>>>> need
>>>>>>> to think about four top-level concepts when constructing your data
>>>>>>> processing job:
>>>>>>>
>>>>>>> * Pipelines - The data processing job made of a series of
>>>>>>> computations
>>>>>>> including input, processing, and output
>>>>>>>
>>>>>>> * PCollections - Bounded (or unbounded) datasets which represent the
>>>>>>> input,
>>>>>>> intermediate and output data in pipelines
>>>>>>>
>>>>>>> * PTransforms - A data processing step in a pipeline in which one or
>>>>>>> more
>>>>>>> PCollections are an input and output
>>>>>>>
>>>>>>> * I/O Sources and Sinks - APIs for reading and writing data which are
>>>>>>> the
>>>>>>> roots and endpoints of the pipeline
>>>>>>>
>>>>>>> == Rationale ==
>>>>>>>
>>>>>>> With Dataflow, Google intended to develop a framework which allowed
>>>>>>> developers to be maximally productive in defining the processing, and
>>>>>>> then
>>>>>>> be able to execute the program at various levels of
>>>>>>> latency/cost/completeness without re-architecting or re-writing it.
>>>>>>> This
>>>>>>> goal was informed by Google’s past experience  developing several
>>>>>>> models,
>>>>>>> frameworks, and tools useful for large-scale and distributed data
>>>>>>> processing. While Google has previously published papers describing
>>>>>>> some
>>>>>>> of
>>>>>>> its technologies, Google decided to take a different approach with
>>>>>>> Dataflow. Google open-sourced the SDK and model alongside
>>>>>>> commercialization
>>>>>>> of the idea and ahead of publishing papers on the topic. As a
>>>>>>> result, a
>>>>>>> number of open source runtimes exist for Dataflow, such as the Apache
>>>>>>> Flink
>>>>>>> and Apache Spark runners.
>>>>>>>
>>>>>>> We believe that submitting Dataflow as an Apache project will provide
>>>>>>> an
>>>>>>> immediate, worthwhile, and substantial contribution to the open
>>>>>>> source
>>>>>>> community. As an incubating project, we believe Dataflow will have a
>>>>>>> better
>>>>>>> opportunity to provide a meaningful contribution to OSS and also
>>>>>>> integrate
>>>>>>> with other Apache projects.
>>>>>>>
>>>>>>> In the long term, we believe Dataflow can be a powerful abstraction
>>>>>>> layer
>>>>>>> for data processing. By providing an abstraction layer for data
>>>>>>> pipelines
>>>>>>> and processing, data workflows can be increasingly portable,
>>>>>>> resilient
>>>>>>> to
>>>>>>> breaking changes in tooling, and compatible across many execution
>>>>>>> engines,
>>>>>>> runtimes, and open source projects.
>>>>>>>
>>>>>>> == Initial Goals ==
>>>>>>>
>>>>>>> We are breaking our initial goals into immediate (< 2 months),
>>>>>>> short-term
>>>>>>> (2-4 months), and intermediate-term (> 4 months).
>>>>>>>
>>>>>>> Our immediate goals include the following:
>>>>>>>
>>>>>>> * Plan for reconciling the Dataflow Java SDK and various runners into
>>>>>>> one
>>>>>>> project
>>>>>>>
>>>>>>> * Plan for refactoring the existing Java SDK for better extensibility
>>>>>>> by
>>>>>>> SDK and runner writers
>>>>>>>
>>>>>>> * Validating all dependencies are ASL 2.0 or compatible
>>>>>>>
>>>>>>> * Understanding and adapting to the Apache development process
>>>>>>>
>>>>>>> Our short-term goals include:
>>>>>>>
>>>>>>> * Moving the newly-merged lists, and build utilities to Apache
>>>>>>>
>>>>>>> * Start refactoring codebase and move code to Apache Git repo
>>>>>>>
>>>>>>> * Continue development of new features, functions, and fixes in the
>>>>>>> Dataflow Java SDK, and Dataflow runners
>>>>>>>
>>>>>>> * Cleaning up the Dataflow SDK sources and crafting a roadmap and
>>>>>>> plan
>>>>>>> for
>>>>>>> how to include new major ideas, modules, and runtimes
>>>>>>>
>>>>>>> * Establishment of easy and clear build/test framework for Dataflow
>>>>>>> and
>>>>>>> associated runtimes; creation of testing, rollback, and validation
>>>>>>> policy
>>>>>>>
>>>>>>> * Analysis and design for work needed to make Dataflow a better data
>>>>>>> processing abstraction layer for multiple open source frameworks and
>>>>>>> environments
>>>>>>>
>>>>>>> Finally, we have a number of intermediate-term goals:
>>>>>>>
>>>>>>> * Roadmapping, planning, and execution of integrations with other OSS
>>>>>>> and
>>>>>>> non-OSS projects/products
>>>>>>>
>>>>>>> * Inclusion of additional SDK for Python, which is under active
>>>>>>> development
>>>>>>>
>>>>>>> == Current Status ==
>>>>>>>
>>>>>>> === Meritocracy ===
>>>>>>>
>>>>>>> Dataflow was initially developed based on ideas from many employees
>>>>>>> within
>>>>>>> Google. As an ASL OSS project on GitHub, the Dataflow SDK has
>>>>>>> received
>>>>>>> contributions from data Artisans, Cloudera Labs, and other individual
>>>>>>> developers. As a project under incubation, we are committed to
>>>>>>> expanding
>>>>>>> our effort to build an environment which supports a meritocracy. We
>>>>>>> are
>>>>>>> focused on engaging the community and other related projects for
>>>>>>> support
>>>>>>> and contributions. Moreover, we are committed to ensure contributors
>>>>>>> and
>>>>>>> committers to Dataflow come from a broad mix of organizations
>>>>>>> through a
>>>>>>> merit-based decision process during incubation. We believe strongly
>>>>>>> in
>>>>>>> the
>>>>>>> Dataflow model and are committed to growing an inclusive community of
>>>>>>> Dataflow contributors.
>>>>>>>
>>>>>>> === Community ===
>>>>>>>
>>>>>>> The core of the Dataflow Java SDK has been developed by Google for
>>>>>>> use
>>>>>>> with
>>>>>>> Google Cloud Dataflow. Google has active community engagement in the
>>>>>>> SDK
>>>>>>> GitHub repository (
>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK
>>>>>>> ),
>>>>>>> on Stack Overflow (
>>>>>>> http://stackoverflow.com/questions/tagged/google-cloud-dataflow) and
>>>>>>> has
>>>>>>> had contributions from a number of organizations and indivuduals.
>>>>>>>
>>>>>>> Everyday, Cloud Dataflow is actively used by a number of
>>>>>>> organizations
>>>>>>> and
>>>>>>> institutions for batch and stream processing of data. We believe
>>>>>>> acceptance
>>>>>>> will allow us to consolidate existing Dataflow-related work, grow the
>>>>>>> Dataflow community, and deepen connections between Dataflow and other
>>>>>>> open
>>>>>>> source projects.
>>>>>>>
>>>>>>> === Core Developers ===
>>>>>>>
>>>>>>> The core developers for Dataflow and the Dataflow runners are:
>>>>>>>
>>>>>>> * Frances Perry
>>>>>>>
>>>>>>> * Tyler Akidau
>>>>>>>
>>>>>>> * Davor Bonaci
>>>>>>>
>>>>>>> * Luke Cwik
>>>>>>>
>>>>>>> * Ben Chambers
>>>>>>>
>>>>>>> * Kenn Knowles
>>>>>>>
>>>>>>> * Dan Halperin
>>>>>>>
>>>>>>> * Daniel Mills
>>>>>>>
>>>>>>> * Mark Shields
>>>>>>>
>>>>>>> * Craig Chambers
>>>>>>>
>>>>>>> * Maximilian Michels
>>>>>>>
>>>>>>> * Tom White
>>>>>>>
>>>>>>> * Josh Wills
>>>>>>>
>>>>>>> === Alignment ===
>>>>>>>
>>>>>>> The Dataflow SDK can be used to create Dataflow pipelines which can
>>>>>>> be
>>>>>>> executed on Apache Spark or Apache Flink. Dataflow is also related to
>>>>>>> other
>>>>>>> Apache projects, such as Apache Crunch. We plan on expanding
>>>>>>> functionality
>>>>>>> for Dataflow runners, support for additional domain specific
>>>>>>> languages,
>>>>>>> and
>>>>>>> increased portability so Dataflow is a powerful abstraction layer for
>>>>>>> data
>>>>>>> processing.
>>>>>>>
>>>>>>> == Known Risks ==
>>>>>>>
>>>>>>> === Orphaned Products ===
>>>>>>>
>>>>>>> The Dataflow SDK is presently used by several organizations, from
>>>>>>> small
>>>>>>> startups to Fortune 100 companies, to construct production pipelines
>>>>>>> which
>>>>>>> are executed in Google Cloud Dataflow. Google has a long-term
>>>>>>> commitment
>>>>>>> to
>>>>>>> advance the Dataflow SDK; moreover, Dataflow is seeing increasing
>>>>>>> interest,
>>>>>>> development, and adoption from organizations outside of Google.
>>>>>>>
>>>>>>> === Inexperience with Open Source ===
>>>>>>>
>>>>>>> Google believes strongly in open source and the exchange of
>>>>>>> information
>>>>>>> to
>>>>>>> advance new ideas and work. Examples of this commitment are active
>>>>>>> OSS
>>>>>>> projects such as Chromium (https://www.chromium.org) and Kubernetes
>>>>>>> (
>>>>>>> http://kubernetes.io/). With Dataflow, we have tried to be
>>>>>>> increasingly
>>>>>>> open and forward-looking; we have published a paper in the VLDB
>>>>>>> conference
>>>>>>> describing the Dataflow model (
>>>>>>> http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf) and were quick to
>>>>>>> release
>>>>>>> the Dataflow SDK as open source software with the launch of Cloud
>>>>>>> Dataflow.
>>>>>>> Our submission to the Apache Software Foundation is a logical
>>>>>>> extension
>>>>>>> of
>>>>>>> our commitment to open source software.
>>>>>>>
>>>>>>> === Homogeneous Developers ===
>>>>>>>
>>>>>>> The majority of committers in this proposal belong to Google due to
>>>>>>> the
>>>>>>> fact that Dataflow has emerged from several internal Google projects.
>>>>>>> This
>>>>>>> proposal also includes committers outside of Google who are actively
>>>>>>> involved with other Apache projects, such as Hadoop, Flink, and
>>>>>>> Spark.
>>>>>>> We
>>>>>>> expect our entry into incubation will allow us to expand the number
>>>>>>> of
>>>>>>> individuals and organizations participating in Dataflow development.
>>>>>>> Additionally, separation of the Dataflow SDK from Google Cloud
>>>>>>> Dataflow
>>>>>>> allows us to focus on the open source SDK and model and do what is
>>>>>>> best
>>>>>>> for
>>>>>>> this project.
>>>>>>>
>>>>>>> === Reliance on Salaried Developers ===
>>>>>>>
>>>>>>> The Dataflow SDK and Dataflow runners have been developed primarily
>>>>>>> by
>>>>>>> salaried developers supporting the Google Cloud Dataflow project.
>>>>>>> While
>>>>>>> the
>>>>>>> Dataflow SDK and Cloud Dataflow have been developed by different
>>>>>>> teams
>>>>>>> (and
>>>>>>> this proposal would reinforce that separation) we expect our initial
>>>>>>> set
>>>>>>> of
>>>>>>> developers will still primarily be salaried. Contribution has not
>>>>>>> been
>>>>>>> exclusively from salaried developers, however. For example, the
>>>>>>> contrib
>>>>>>> directory of the Dataflow SDK (
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/tree/master/contrib
>>>>>>> )
>>>>>>> contains items from free-time contributors. Moreover, seperate
>>>>>>> projects,
>>>>>>> such as ScalaFlow (https://github.com/darkjh/scalaflow) have been
>>>>>>> created
>>>>>>> around the Dataflow model and SDK. We expect our reliance on salaried
>>>>>>> developers will decrease over time during incubation.
>>>>>>>
>>>>>>> === Relationship with other Apache products ===
>>>>>>>
>>>>>>> Dataflow directly interoperates with or utilizes several existing
>>>>>>> Apache
>>>>>>> projects.
>>>>>>>
>>>>>>> * Build
>>>>>>>
>>>>>>> ** Apache Maven
>>>>>>>
>>>>>>> * Data I/O, Libraries
>>>>>>>
>>>>>>> ** Apache Avro
>>>>>>>
>>>>>>> ** Apache Commons
>>>>>>>
>>>>>>> * Dataflow runners
>>>>>>>
>>>>>>> ** Apache Flink
>>>>>>>
>>>>>>> ** Apache Spark
>>>>>>>
>>>>>>> Dataflow when used in batch mode shares similarities with Apache
>>>>>>> Crunch;
>>>>>>> however, Dataflow is focused on a model, SDK, and abstraction layer
>>>>>>> beyond
>>>>>>> Spark and Hadoop (MapReduce.) One key goal of Dataflow is to provide
>>>>>>> an
>>>>>>> intermediate abstraction layer which can easily be implemented and
>>>>>>> utilized
>>>>>>> across several different processing frameworks.
>>>>>>>
>>>>>>> === An excessive fascination with the Apache brand ===
>>>>>>>
>>>>>>> With this proposal we are not seeking attention or publicity. Rather,
>>>>>>> we
>>>>>>> firmly believe in the Dataflow model, SDK, and the ability to make
>>>>>>> Dataflow
>>>>>>> a powerful yet simple framework for data processing. While the
>>>>>>> Dataflow
>>>>>>> SDK
>>>>>>> and model have been open source, we believe putting code on GitHub
>>>>>>> can
>>>>>>> only
>>>>>>> go so far. We see the Apache community, processes, and mission as
>>>>>>> critical
>>>>>>> for ensuring the Dataflow SDK and model are truly community-driven,
>>>>>>> positively impactful, and innovative open source software. While
>>>>>>> Google
>>>>>>> has
>>>>>>> taken a number of steps to advance its various open source projects,
>>>>>>> we
>>>>>>> believe Dataflow is a great fit for the Apache Software Foundation
>>>>>>> due
>>>>>>> to
>>>>>>> its focus on data processing and its relationships to existing ASF
>>>>>>> projects.
>>>>>>>
>>>>>>> == Documentation ==
>>>>>>>
>>>>>>> The following documentation is relevant to this proposal. Relevant
>>>>>>> portion
>>>>>>> of the documentation will be contributed to the Apache Dataflow
>>>>>>> project.
>>>>>>>
>>>>>>> * Dataflow website: https://cloud.google.com/dataflow
>>>>>>>
>>>>>>> * Dataflow programming model:
>>>>>>> https://cloud.google.com/dataflow/model/programming-model
>>>>>>>
>>>>>>> * Codebases
>>>>>>>
>>>>>>> ** Dataflow Java SDK:
>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK
>>>>>>>
>>>>>>> ** Flink Dataflow runner:
>>>>>>> https://github.com/dataArtisans/flink-dataflow
>>>>>>>
>>>>>>> ** Spark Dataflow runner: https://github.com/cloudera/spark-dataflow
>>>>>>>
>>>>>>> * Dataflow Java SDK issue tracker:
>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues
>>>>>>>
>>>>>>> * google-cloud-dataflow tag on Stack Overflow:
>>>>>>> http://stackoverflow.com/questions/tagged/google-cloud-dataflow
>>>>>>>
>>>>>>> == Initial Source ==
>>>>>>>
>>>>>>> The initial source for Dataflow which we will submit to the Apache
>>>>>>> Foundation will include several related projects which are currently
>>>>>>> hosted
>>>>>>> on the GitHub repositories:
>>>>>>>
>>>>>>> * Dataflow Java SDK (
>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK)
>>>>>>>
>>>>>>> * Flink Dataflow runner (
>>>>>>> https://github.com/dataArtisans/flink-dataflow)
>>>>>>>
>>>>>>> * Spark Dataflow runner (https://github.com/cloudera/spark-dataflow)
>>>>>>>
>>>>>>> These projects have always been Apache 2.0 licensed. We intend to
>>>>>>> bundle
>>>>>>> all of these repositories since they are all complimentary and should
>>>>>>> be
>>>>>>> maintained in one project. Prior to our submission, we will combine
>>>>>>> all
>>>>>>> of
>>>>>>> these projects into a new git repository.
>>>>>>>
>>>>>>> == Source and Intellectual Property Submission Plan ==
>>>>>>>
>>>>>>> The source for the Dataflow SDK and the three runners (Spark, Flink,
>>>>>>> Google
>>>>>>> Cloud Dataflow) are already licensed under an Apache 2 license.
>>>>>>>
>>>>>>> * Dataflow SDK -
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/LICENSE
>>>>>>>
>>>>>>> * Flink runner -
>>>>>>> https://github.com/dataArtisans/flink-dataflow/blob/master/LICENSE
>>>>>>>
>>>>>>> * Spark runner -
>>>>>>> https://github.com/cloudera/spark-dataflow/blob/master/LICENSE
>>>>>>>
>>>>>>> Contributors to the Dataflow SDK have also signed the Google
>>>>>>> Individual
>>>>>>> Contributor License Agreement (
>>>>>>> https://cla.developers.google.com/about/google-individual) in order
>>>>>>> to
>>>>>>> contribute to the project.
>>>>>>>
>>>>>>> With respect to trademark rights, Google does not hold a trademark on
>>>>>>> the
>>>>>>> phrase “Dataflow.” Based on feedback and guidance we receive during
>>>>>>> the
>>>>>>> incubation process, we are open to renaming the project if necessary
>>>>>>> for
>>>>>>> trademark or other concerns.
>>>>>>>
>>>>>>> == External Dependencies ==
>>>>>>>
>>>>>>> All external dependencies are licensed under an Apache 2.0 or
>>>>>>> Apache-compatible license. As we grow the Dataflow community we will
>>>>>>> configure our build process to require and validate all contributions
>>>>>>> and
>>>>>>> dependencies are licensed under the Apache 2.0 license or are under
>>>>>>> an
>>>>>>> Apache-compatible license.
>>>>>>>
>>>>>>> == Required Resources ==
>>>>>>>
>>>>>>> === Mailing Lists ===
>>>>>>>
>>>>>>> We currently use a mix of mailing lists. We will migrate our existing
>>>>>>> mailing lists to the following:
>>>>>>>
>>>>>>> * d...@dataflow.incubator.apache.org
>>>>>>>
>>>>>>> * u...@dataflow.incubator.apache.org
>>>>>>>
>>>>>>> * priv...@dataflow.incubator.apache.org
>>>>>>>
>>>>>>> * comm...@dataflow.incubator.apache.org
>>>>>>>
>>>>>>> === Source Control ===
>>>>>>>
>>>>>>> The Dataflow team currently uses Git and would like to continue to do
>>>>>>> so.
>>>>>>> We request a Git repository for Dataflow with mirroring to GitHub
>>>>>>> enabled.
>>>>>>>
>>>>>>> === Issue Tracking ===
>>>>>>>
>>>>>>> We request the creation of an Apache-hosted JIRA. The Dataflow
>>>>>>> project
>>>>>>> is
>>>>>>> currently using both a public GitHub issue tracker and internal
>>>>>>> Google
>>>>>>> issue tracking. We will migrate and combine from these two sources to
>>>>>>> the
>>>>>>> Apache JIRA.
>>>>>>>
>>>>>>> == Initial Committers ==
>>>>>>>
>>>>>>> * Aljoscha Krettek     [aljos...@apache.org]
>>>>>>>
>>>>>>> * Amit Sela            [amitsel...@gmail.com]
>>>>>>>
>>>>>>> * Ben Chambers         [bchamb...@google.com]
>>>>>>>
>>>>>>> * Craig Chambers       [chamb...@google.com]
>>>>>>>
>>>>>>> * Dan Halperin         [dhalp...@google.com]
>>>>>>>
>>>>>>> * Davor Bonaci         [da...@google.com]
>>>>>>>
>>>>>>> * Frances Perry        [f...@google.com]
>>>>>>>
>>>>>>> * James Malone         [jamesmal...@google.com]
>>>>>>>
>>>>>>> * Jean-Baptiste Onofré [jbono...@apache.org]
>>>>>>>
>>>>>>> * Josh Wills           [jwi...@apache.org]
>>>>>>>
>>>>>>> * Kostas Tzoumas       [kos...@data-artisans.com]
>>>>>>>
>>>>>>> * Kenneth Knowles      [k...@google.com]
>>>>>>>
>>>>>>> * Luke Cwik            [lc...@google.com]
>>>>>>>
>>>>>>> * Maximilian Michels   [m...@apache.org]
>>>>>>>
>>>>>>> * Stephan Ewen         [step...@data-artisans.com]
>>>>>>>
>>>>>>> * Tom White            [t...@cloudera.com]
>>>>>>>
>>>>>>> * Tyler Akidau         [taki...@google.com]
>>>>>>>
>>>>>>> == Affiliations ==
>>>>>>>
>>>>>>> The initial committers are from six organizations. Google developed
>>>>>>> Dataflow and the Dataflow SDK, data Artisans developed the Flink
>>>>>>> runner,
>>>>>>> and Cloudera (Labs) developed the Spark runner.
>>>>>>>
>>>>>>> * Cloudera
>>>>>>>
>>>>>>> ** Tom White
>>>>>>>
>>>>>>> * Data Artisans
>>>>>>>
>>>>>>> ** Aljoscha Krettek
>>>>>>>
>>>>>>> ** Kostas Tzoumas
>>>>>>>
>>>>>>> ** Maximilian Michels
>>>>>>>
>>>>>>> ** Stephan Ewen
>>>>>>>
>>>>>>> * Google
>>>>>>>
>>>>>>> ** Ben Chambers
>>>>>>>
>>>>>>> ** Dan Halperin
>>>>>>>
>>>>>>> ** Davor Bonaci
>>>>>>>
>>>>>>> ** Frances Perry
>>>>>>>
>>>>>>> ** James Malone
>>>>>>>
>>>>>>> ** Kenneth Knowles
>>>>>>>
>>>>>>> ** Luke Cwik
>>>>>>>
>>>>>>> ** Tyler Akidau
>>>>>>>
>>>>>>> * PayPal
>>>>>>>
>>>>>>> ** Amit Sela
>>>>>>>
>>>>>>> * Slack
>>>>>>>
>>>>>>> ** Josh Wills
>>>>>>>
>>>>>>> * Talend
>>>>>>>
>>>>>>> ** Jean-Baptiste Onofré
>>>>>>>
>>>>>>> == Sponsors ==
>>>>>>>
>>>>>>> === Champion ===
>>>>>>>
>>>>>>> * Jean-Baptiste Onofre      [jbono...@apache.org]
>>>>>>>
>>>>>>> === Nominated Mentors ===
>>>>>>>
>>>>>>> * Jim Jagielski           [j...@apache.org]
>>>>>>>
>>>>>>> * Venkatesh Seetharam     [venkat...@apache.org]
>>>>>>>
>>>>>>> * Bertrand Delacretaz     [bdelacre...@apache.org]
>>>>>>>
>>>>>>> * Ted Dunning             [tdunn...@apache.org]
>>>>>>>
>>>>>>> === Sponsoring Entity ===
>>>>>>>
>>>>>>> The Apache Incubator
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>> Jean-Baptiste Onofré
>>>>> jbono...@apache.org
>>>>> http://blog.nanthrax.net
>>>>> Talend - http://www.talend.com
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>>>>> For additional commands, e-mail: general-h...@incubator.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>>> For additional commands, e-mail: general-h...@incubator.apache.org
>>>
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>

Re: [DISCUSS] Apache Dataflow Incubator Proposal

Reply via email to