We are developing parallel machine learning algorithms for a research project and are very interested in DataFlow. I would like to contribute to this project as well. It will be great if you can add me.
Thanks, Supun... On Thu, Jan 21, 2016 at 6:29 PM, Mayank Bansal <maban...@gmail.com> wrote: > Hi Jean, > > Nice Proposal. > > I wanted to contribute to this project. Can you please add me too? > > Thanks a lot for the help > > Thanks, > Mayank > > On Thu, Jan 21, 2016 at 8:07 AM, Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > > > Hey Alex, > > > > awesome: I added you on the proposal. > > > > Thanks, > > Regards > > JB > > > > > > On 01/21/2016 05:03 PM, Alexander Bezzubov wrote: > > > >> Hi, > >> > >> it's great to see DataFlow becoming part to Apache ecosystem, thank you > >> bringing it in. > >> I would be happy to get involved and help. > >> > >> -- > >> Alex > >> > >> On Thu, Jan 21, 2016 at 8:42 PM, Jean-Baptiste Onofré <j...@nanthrax.net> > >> wrote: > >> > >> Perfect: done, you are on the proposal. > >>> > >>> Thanks ! > >>> Regards > >>> JB > >>> > >>> > >>> On 01/21/2016 11:55 AM, chatz wrote: > >>> > >>> Charitha Elvitigala > >>>> > >>>> On 21 January 2016 at 16:17, Jean-Baptiste Onofré <j...@nanthrax.net> > >>>> wrote: > >>>> > >>>> Hi Chatz, > >>>> > >>>>> > >>>>> sure, what name should I use on the proposal, Charitha ? > >>>>> > >>>>> Regards > >>>>> JB > >>>>> > >>>>> > >>>>> On 01/21/2016 11:32 AM, chatz wrote: > >>>>> > >>>>> Hi Jean, > >>>>> > >>>>>> > >>>>>> I’d be interested in contributing as well. > >>>>>> > >>>>>> Thanks, > >>>>>> > >>>>>> Chatz > >>>>>> > >>>>>> > >>>>>> On 21 January 2016 at 14:22, Jean-Baptiste Onofré <j...@nanthrax.net> > >>>>>> wrote: > >>>>>> > >>>>>> Sweet: you are on the proposal ;) > >>>>>> > >>>>>> > >>>>>>> Thanks ! > >>>>>>> Regards > >>>>>>> JB > >>>>>>> > >>>>>>> > >>>>>>> On 01/21/2016 08:55 AM, Byung-Gon Chun wrote: > >>>>>>> > >>>>>>> This looks very interesting. I'm interested in contributing. > >>>>>>> > >>>>>>> > >>>>>>>> Thanks. > >>>>>>>> -Gon > >>>>>>>> > >>>>>>>> --- > >>>>>>>> Byung-Gon Chun > >>>>>>>> > >>>>>>>> > >>>>>>>> On Thu, Jan 21, 2016 at 1:32 AM, James Malone < > >>>>>>>> jamesmal...@google.com.invalid> wrote: > >>>>>>>> > >>>>>>>> Hello everyone, > >>>>>>>> > >>>>>>>> > >>>>>>>> Attached to this message is a proposed new project - Apache > >>>>>>>>> Dataflow, a > >>>>>>>>> unified programming model for data processing and integration. > >>>>>>>>> > >>>>>>>>> The text of the proposal is included below. Additionally, the > >>>>>>>>> proposal > >>>>>>>>> is > >>>>>>>>> in draft form on the wiki where we will make any required > changes: > >>>>>>>>> > >>>>>>>>> https://wiki.apache.org/incubator/DataflowProposal > >>>>>>>>> > >>>>>>>>> We look forward to your feedback and input. > >>>>>>>>> > >>>>>>>>> Best, > >>>>>>>>> > >>>>>>>>> James > >>>>>>>>> > >>>>>>>>> ---- > >>>>>>>>> > >>>>>>>>> = Apache Dataflow = > >>>>>>>>> > >>>>>>>>> == Abstract == > >>>>>>>>> > >>>>>>>>> Dataflow is an open source, unified model and set of > >>>>>>>>> language-specific > >>>>>>>>> SDKs > >>>>>>>>> for defining and executing data processing workflows, and also > data > >>>>>>>>> ingestion and integration flows, supporting Enterprise > Integration > >>>>>>>>> Patterns > >>>>>>>>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines > >>>>>>>>> simplify > >>>>>>>>> the mechanics of large-scale batch and streaming data processing > >>>>>>>>> and > >>>>>>>>> can > >>>>>>>>> run on a number of runtimes like Apache Flink, Apache Spark, and > >>>>>>>>> Google > >>>>>>>>> Cloud Dataflow (a cloud service). Dataflow also brings DSL in > >>>>>>>>> different > >>>>>>>>> languages, allowing users to easily implement their data > >>>>>>>>> integration > >>>>>>>>> processes. > >>>>>>>>> > >>>>>>>>> == Proposal == > >>>>>>>>> > >>>>>>>>> Dataflow is a simple, flexible, and powerful system for > distributed > >>>>>>>>> data > >>>>>>>>> processing at any scale. Dataflow provides a unified programming > >>>>>>>>> model, a > >>>>>>>>> software development kit to define and construct data processing > >>>>>>>>> pipelines, > >>>>>>>>> and runners to execute Dataflow pipelines in several runtime > >>>>>>>>> engines, > >>>>>>>>> like > >>>>>>>>> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow > can > >>>>>>>>> be > >>>>>>>>> used > >>>>>>>>> for a variety of streaming or batch data processing goals > including > >>>>>>>>> ETL, > >>>>>>>>> stream analysis, and aggregate computation. The underlying > >>>>>>>>> programming > >>>>>>>>> model for Dataflow provides MapReduce-like parallelism, combined > >>>>>>>>> with > >>>>>>>>> support for powerful data windowing, and fine-grained correctness > >>>>>>>>> control. > >>>>>>>>> > >>>>>>>>> == Background == > >>>>>>>>> > >>>>>>>>> Dataflow started as a set of Google projects focused on making > data > >>>>>>>>> processing easier, faster, and less costly. The Dataflow model > is a > >>>>>>>>> successor to MapReduce, FlumeJava, and Millwheel inside Google > and > >>>>>>>>> is > >>>>>>>>> focused on providing a unified solution for batch and stream > >>>>>>>>> processing. > >>>>>>>>> These projects on which Dataflow is based have been published in > >>>>>>>>> several > >>>>>>>>> papers made available to the public: > >>>>>>>>> > >>>>>>>>> * MapReduce - http://research.google.com/archive/mapreduce.html > >>>>>>>>> > >>>>>>>>> * Dataflow model - > >>>>>>>>> http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf > >>>>>>>>> > >>>>>>>>> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf > >>>>>>>>> > >>>>>>>>> * MillWheel - http://research.google.com/pubs/pub41378.html > >>>>>>>>> > >>>>>>>>> Dataflow was designed from the start to provide a portable > >>>>>>>>> programming > >>>>>>>>> layer. When you define a data processing pipeline with the > Dataflow > >>>>>>>>> model, > >>>>>>>>> you are creating a job which is capable of being processed by any > >>>>>>>>> number > >>>>>>>>> of > >>>>>>>>> Dataflow processing engines. Several engines have been developed > to > >>>>>>>>> run > >>>>>>>>> Dataflow pipelines in other open source runtimes, including a > >>>>>>>>> Dataflow > >>>>>>>>> runner for Apache Flink and Apache Spark. There is also a “direct > >>>>>>>>> runner”, > >>>>>>>>> for execution on the developer machine (mainly for dev/debug > >>>>>>>>> purposes). > >>>>>>>>> Another runner allows a Dataflow program to run on a managed > >>>>>>>>> service, > >>>>>>>>> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow > Java > >>>>>>>>> SDK > >>>>>>>>> is > >>>>>>>>> already available on GitHub, and independent from the Google > Cloud > >>>>>>>>> Dataflow > >>>>>>>>> service. Another Python SDK is currently in active development. > >>>>>>>>> > >>>>>>>>> In this proposal, the Dataflow SDKs, model, and a set of runners > >>>>>>>>> will > >>>>>>>>> be > >>>>>>>>> submitted as an OSS project under the ASF. The runners which are > a > >>>>>>>>> part > >>>>>>>>> of > >>>>>>>>> this proposal include those for Spark (from Cloudera), Flink > (from > >>>>>>>>> data > >>>>>>>>> Artisans), and local development (from Google); the Google Cloud > >>>>>>>>> Dataflow > >>>>>>>>> service runner is not included in this proposal. Further > references > >>>>>>>>> to > >>>>>>>>> Dataflow will refer to the Dataflow model, SDKs, and runners > which > >>>>>>>>> are > >>>>>>>>> a > >>>>>>>>> part of this proposal (Apache Dataflow) only. The initial > >>>>>>>>> submission > >>>>>>>>> will > >>>>>>>>> contain the already-released Java SDK; Google intends to submit > the > >>>>>>>>> Python > >>>>>>>>> SDK later in the incubation process. The Google Cloud Dataflow > >>>>>>>>> service > >>>>>>>>> will > >>>>>>>>> continue to be one of many runners for Dataflow, built on Google > >>>>>>>>> Cloud > >>>>>>>>> Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow > >>>>>>>>> will > >>>>>>>>> develop against the Apache project additions, updates, and > changes. > >>>>>>>>> Google > >>>>>>>>> Cloud Dataflow will become one user of Apache Dataflow and will > >>>>>>>>> participate > >>>>>>>>> in the project openly and publicly. > >>>>>>>>> > >>>>>>>>> The Dataflow programming model has been designed with simplicity, > >>>>>>>>> scalability, and speed as key tenants. In the Dataflow model, you > >>>>>>>>> only > >>>>>>>>> need > >>>>>>>>> to think about four top-level concepts when constructing your > data > >>>>>>>>> processing job: > >>>>>>>>> > >>>>>>>>> * Pipelines - The data processing job made of a series of > >>>>>>>>> computations > >>>>>>>>> including input, processing, and output > >>>>>>>>> > >>>>>>>>> * PCollections - Bounded (or unbounded) datasets which represent > >>>>>>>>> the > >>>>>>>>> input, > >>>>>>>>> intermediate and output data in pipelines > >>>>>>>>> > >>>>>>>>> * PTransforms - A data processing step in a pipeline in which one > >>>>>>>>> or > >>>>>>>>> more > >>>>>>>>> PCollections are an input and output > >>>>>>>>> > >>>>>>>>> * I/O Sources and Sinks - APIs for reading and writing data which > >>>>>>>>> are > >>>>>>>>> the > >>>>>>>>> roots and endpoints of the pipeline > >>>>>>>>> > >>>>>>>>> == Rationale == > >>>>>>>>> > >>>>>>>>> With Dataflow, Google intended to develop a framework which > allowed > >>>>>>>>> developers to be maximally productive in defining the processing, > >>>>>>>>> and > >>>>>>>>> then > >>>>>>>>> be able to execute the program at various levels of > >>>>>>>>> latency/cost/completeness without re-architecting or re-writing > it. > >>>>>>>>> This > >>>>>>>>> goal was informed by Google’s past experience developing several > >>>>>>>>> models, > >>>>>>>>> frameworks, and tools useful for large-scale and distributed data > >>>>>>>>> processing. While Google has previously published papers > describing > >>>>>>>>> some > >>>>>>>>> of > >>>>>>>>> its technologies, Google decided to take a different approach > with > >>>>>>>>> Dataflow. Google open-sourced the SDK and model alongside > >>>>>>>>> commercialization > >>>>>>>>> of the idea and ahead of publishing papers on the topic. As a > >>>>>>>>> result, a > >>>>>>>>> number of open source runtimes exist for Dataflow, such as the > >>>>>>>>> Apache > >>>>>>>>> Flink > >>>>>>>>> and Apache Spark runners. > >>>>>>>>> > >>>>>>>>> We believe that submitting Dataflow as an Apache project will > >>>>>>>>> provide > >>>>>>>>> an > >>>>>>>>> immediate, worthwhile, and substantial contribution to the open > >>>>>>>>> source > >>>>>>>>> community. As an incubating project, we believe Dataflow will > have > >>>>>>>>> a > >>>>>>>>> better > >>>>>>>>> opportunity to provide a meaningful contribution to OSS and also > >>>>>>>>> integrate > >>>>>>>>> with other Apache projects. > >>>>>>>>> > >>>>>>>>> In the long term, we believe Dataflow can be a powerful > abstraction > >>>>>>>>> layer > >>>>>>>>> for data processing. By providing an abstraction layer for data > >>>>>>>>> pipelines > >>>>>>>>> and processing, data workflows can be increasingly portable, > >>>>>>>>> resilient > >>>>>>>>> to > >>>>>>>>> breaking changes in tooling, and compatible across many execution > >>>>>>>>> engines, > >>>>>>>>> runtimes, and open source projects. > >>>>>>>>> > >>>>>>>>> == Initial Goals == > >>>>>>>>> > >>>>>>>>> We are breaking our initial goals into immediate (< 2 months), > >>>>>>>>> short-term > >>>>>>>>> (2-4 months), and intermediate-term (> 4 months). > >>>>>>>>> > >>>>>>>>> Our immediate goals include the following: > >>>>>>>>> > >>>>>>>>> * Plan for reconciling the Dataflow Java SDK and various runners > >>>>>>>>> into > >>>>>>>>> one > >>>>>>>>> project > >>>>>>>>> > >>>>>>>>> * Plan for refactoring the existing Java SDK for better > >>>>>>>>> extensibility > >>>>>>>>> by > >>>>>>>>> SDK and runner writers > >>>>>>>>> > >>>>>>>>> * Validating all dependencies are ASL 2.0 or compatible > >>>>>>>>> > >>>>>>>>> * Understanding and adapting to the Apache development process > >>>>>>>>> > >>>>>>>>> Our short-term goals include: > >>>>>>>>> > >>>>>>>>> * Moving the newly-merged lists, and build utilities to Apache > >>>>>>>>> > >>>>>>>>> * Start refactoring codebase and move code to Apache Git repo > >>>>>>>>> > >>>>>>>>> * Continue development of new features, functions, and fixes in > the > >>>>>>>>> Dataflow Java SDK, and Dataflow runners > >>>>>>>>> > >>>>>>>>> * Cleaning up the Dataflow SDK sources and crafting a roadmap and > >>>>>>>>> plan > >>>>>>>>> for > >>>>>>>>> how to include new major ideas, modules, and runtimes > >>>>>>>>> > >>>>>>>>> * Establishment of easy and clear build/test framework for > Dataflow > >>>>>>>>> and > >>>>>>>>> associated runtimes; creation of testing, rollback, and > validation > >>>>>>>>> policy > >>>>>>>>> > >>>>>>>>> * Analysis and design for work needed to make Dataflow a better > >>>>>>>>> data > >>>>>>>>> processing abstraction layer for multiple open source frameworks > >>>>>>>>> and > >>>>>>>>> environments > >>>>>>>>> > >>>>>>>>> Finally, we have a number of intermediate-term goals: > >>>>>>>>> > >>>>>>>>> * Roadmapping, planning, and execution of integrations with other > >>>>>>>>> OSS > >>>>>>>>> and > >>>>>>>>> non-OSS projects/products > >>>>>>>>> > >>>>>>>>> * Inclusion of additional SDK for Python, which is under active > >>>>>>>>> development > >>>>>>>>> > >>>>>>>>> == Current Status == > >>>>>>>>> > >>>>>>>>> === Meritocracy === > >>>>>>>>> > >>>>>>>>> Dataflow was initially developed based on ideas from many > employees > >>>>>>>>> within > >>>>>>>>> Google. As an ASL OSS project on GitHub, the Dataflow SDK has > >>>>>>>>> received > >>>>>>>>> contributions from data Artisans, Cloudera Labs, and other > >>>>>>>>> individual > >>>>>>>>> developers. As a project under incubation, we are committed to > >>>>>>>>> expanding > >>>>>>>>> our effort to build an environment which supports a meritocracy. > We > >>>>>>>>> are > >>>>>>>>> focused on engaging the community and other related projects for > >>>>>>>>> support > >>>>>>>>> and contributions. Moreover, we are committed to ensure > >>>>>>>>> contributors > >>>>>>>>> and > >>>>>>>>> committers to Dataflow come from a broad mix of organizations > >>>>>>>>> through a > >>>>>>>>> merit-based decision process during incubation. We believe > strongly > >>>>>>>>> in > >>>>>>>>> the > >>>>>>>>> Dataflow model and are committed to growing an inclusive > community > >>>>>>>>> of > >>>>>>>>> Dataflow contributors. > >>>>>>>>> > >>>>>>>>> === Community === > >>>>>>>>> > >>>>>>>>> The core of the Dataflow Java SDK has been developed by Google > for > >>>>>>>>> use > >>>>>>>>> with > >>>>>>>>> Google Cloud Dataflow. Google has active community engagement in > >>>>>>>>> the > >>>>>>>>> SDK > >>>>>>>>> GitHub repository ( > >>>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK > >>>>>>>>> ), > >>>>>>>>> on Stack Overflow ( > >>>>>>>>> http://stackoverflow.com/questions/tagged/google-cloud-dataflow) > >>>>>>>>> and > >>>>>>>>> has > >>>>>>>>> had contributions from a number of organizations and indivuduals. > >>>>>>>>> > >>>>>>>>> Everyday, Cloud Dataflow is actively used by a number of > >>>>>>>>> organizations > >>>>>>>>> and > >>>>>>>>> institutions for batch and stream processing of data. We believe > >>>>>>>>> acceptance > >>>>>>>>> will allow us to consolidate existing Dataflow-related work, grow > >>>>>>>>> the > >>>>>>>>> Dataflow community, and deepen connections between Dataflow and > >>>>>>>>> other > >>>>>>>>> open > >>>>>>>>> source projects. > >>>>>>>>> > >>>>>>>>> === Core Developers === > >>>>>>>>> > >>>>>>>>> The core developers for Dataflow and the Dataflow runners are: > >>>>>>>>> > >>>>>>>>> * Frances Perry > >>>>>>>>> > >>>>>>>>> * Tyler Akidau > >>>>>>>>> > >>>>>>>>> * Davor Bonaci > >>>>>>>>> > >>>>>>>>> * Luke Cwik > >>>>>>>>> > >>>>>>>>> * Ben Chambers > >>>>>>>>> > >>>>>>>>> * Kenn Knowles > >>>>>>>>> > >>>>>>>>> * Dan Halperin > >>>>>>>>> > >>>>>>>>> * Daniel Mills > >>>>>>>>> > >>>>>>>>> * Mark Shields > >>>>>>>>> > >>>>>>>>> * Craig Chambers > >>>>>>>>> > >>>>>>>>> * Maximilian Michels > >>>>>>>>> > >>>>>>>>> * Tom White > >>>>>>>>> > >>>>>>>>> * Josh Wills > >>>>>>>>> > >>>>>>>>> === Alignment === > >>>>>>>>> > >>>>>>>>> The Dataflow SDK can be used to create Dataflow pipelines which > can > >>>>>>>>> be > >>>>>>>>> executed on Apache Spark or Apache Flink. Dataflow is also > related > >>>>>>>>> to > >>>>>>>>> other > >>>>>>>>> Apache projects, such as Apache Crunch. We plan on expanding > >>>>>>>>> functionality > >>>>>>>>> for Dataflow runners, support for additional domain specific > >>>>>>>>> languages, > >>>>>>>>> and > >>>>>>>>> increased portability so Dataflow is a powerful abstraction layer > >>>>>>>>> for > >>>>>>>>> data > >>>>>>>>> processing. > >>>>>>>>> > >>>>>>>>> == Known Risks == > >>>>>>>>> > >>>>>>>>> === Orphaned Products === > >>>>>>>>> > >>>>>>>>> The Dataflow SDK is presently used by several organizations, from > >>>>>>>>> small > >>>>>>>>> startups to Fortune 100 companies, to construct production > >>>>>>>>> pipelines > >>>>>>>>> which > >>>>>>>>> are executed in Google Cloud Dataflow. Google has a long-term > >>>>>>>>> commitment > >>>>>>>>> to > >>>>>>>>> advance the Dataflow SDK; moreover, Dataflow is seeing increasing > >>>>>>>>> interest, > >>>>>>>>> development, and adoption from organizations outside of Google. > >>>>>>>>> > >>>>>>>>> === Inexperience with Open Source === > >>>>>>>>> > >>>>>>>>> Google believes strongly in open source and the exchange of > >>>>>>>>> information > >>>>>>>>> to > >>>>>>>>> advance new ideas and work. Examples of this commitment are > active > >>>>>>>>> OSS > >>>>>>>>> projects such as Chromium (https://www.chromium.org) and > >>>>>>>>> Kubernetes > >>>>>>>>> ( > >>>>>>>>> http://kubernetes.io/). With Dataflow, we have tried to be > >>>>>>>>> increasingly > >>>>>>>>> open and forward-looking; we have published a paper in the VLDB > >>>>>>>>> conference > >>>>>>>>> describing the Dataflow model ( > >>>>>>>>> http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf) and were quick > to > >>>>>>>>> release > >>>>>>>>> the Dataflow SDK as open source software with the launch of Cloud > >>>>>>>>> Dataflow. > >>>>>>>>> Our submission to the Apache Software Foundation is a logical > >>>>>>>>> extension > >>>>>>>>> of > >>>>>>>>> our commitment to open source software. > >>>>>>>>> > >>>>>>>>> === Homogeneous Developers === > >>>>>>>>> > >>>>>>>>> The majority of committers in this proposal belong to Google due > to > >>>>>>>>> the > >>>>>>>>> fact that Dataflow has emerged from several internal Google > >>>>>>>>> projects. > >>>>>>>>> This > >>>>>>>>> proposal also includes committers outside of Google who are > >>>>>>>>> actively > >>>>>>>>> involved with other Apache projects, such as Hadoop, Flink, and > >>>>>>>>> Spark. > >>>>>>>>> We > >>>>>>>>> expect our entry into incubation will allow us to expand the > number > >>>>>>>>> of > >>>>>>>>> individuals and organizations participating in Dataflow > >>>>>>>>> development. > >>>>>>>>> Additionally, separation of the Dataflow SDK from Google Cloud > >>>>>>>>> Dataflow > >>>>>>>>> allows us to focus on the open source SDK and model and do what > is > >>>>>>>>> best > >>>>>>>>> for > >>>>>>>>> this project. > >>>>>>>>> > >>>>>>>>> === Reliance on Salaried Developers === > >>>>>>>>> > >>>>>>>>> The Dataflow SDK and Dataflow runners have been developed > primarily > >>>>>>>>> by > >>>>>>>>> salaried developers supporting the Google Cloud Dataflow project. > >>>>>>>>> While > >>>>>>>>> the > >>>>>>>>> Dataflow SDK and Cloud Dataflow have been developed by different > >>>>>>>>> teams > >>>>>>>>> (and > >>>>>>>>> this proposal would reinforce that separation) we expect our > >>>>>>>>> initial > >>>>>>>>> set > >>>>>>>>> of > >>>>>>>>> developers will still primarily be salaried. Contribution has not > >>>>>>>>> been > >>>>>>>>> exclusively from salaried developers, however. For example, the > >>>>>>>>> contrib > >>>>>>>>> directory of the Dataflow SDK ( > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > https://github.com/GoogleCloudPlatform/DataflowJavaSDK/tree/master/contrib > >>>>>>>>> ) > >>>>>>>>> contains items from free-time contributors. Moreover, seperate > >>>>>>>>> projects, > >>>>>>>>> such as ScalaFlow (https://github.com/darkjh/scalaflow) have > been > >>>>>>>>> created > >>>>>>>>> around the Dataflow model and SDK. We expect our reliance on > >>>>>>>>> salaried > >>>>>>>>> developers will decrease over time during incubation. > >>>>>>>>> > >>>>>>>>> === Relationship with other Apache products === > >>>>>>>>> > >>>>>>>>> Dataflow directly interoperates with or utilizes several existing > >>>>>>>>> Apache > >>>>>>>>> projects. > >>>>>>>>> > >>>>>>>>> * Build > >>>>>>>>> > >>>>>>>>> ** Apache Maven > >>>>>>>>> > >>>>>>>>> * Data I/O, Libraries > >>>>>>>>> > >>>>>>>>> ** Apache Avro > >>>>>>>>> > >>>>>>>>> ** Apache Commons > >>>>>>>>> > >>>>>>>>> * Dataflow runners > >>>>>>>>> > >>>>>>>>> ** Apache Flink > >>>>>>>>> > >>>>>>>>> ** Apache Spark > >>>>>>>>> > >>>>>>>>> Dataflow when used in batch mode shares similarities with Apache > >>>>>>>>> Crunch; > >>>>>>>>> however, Dataflow is focused on a model, SDK, and abstraction > layer > >>>>>>>>> beyond > >>>>>>>>> Spark and Hadoop (MapReduce.) One key goal of Dataflow is to > >>>>>>>>> provide > >>>>>>>>> an > >>>>>>>>> intermediate abstraction layer which can easily be implemented > and > >>>>>>>>> utilized > >>>>>>>>> across several different processing frameworks. > >>>>>>>>> > >>>>>>>>> === An excessive fascination with the Apache brand === > >>>>>>>>> > >>>>>>>>> With this proposal we are not seeking attention or publicity. > >>>>>>>>> Rather, > >>>>>>>>> we > >>>>>>>>> firmly believe in the Dataflow model, SDK, and the ability to > make > >>>>>>>>> Dataflow > >>>>>>>>> a powerful yet simple framework for data processing. While the > >>>>>>>>> Dataflow > >>>>>>>>> SDK > >>>>>>>>> and model have been open source, we believe putting code on > GitHub > >>>>>>>>> can > >>>>>>>>> only > >>>>>>>>> go so far. We see the Apache community, processes, and mission as > >>>>>>>>> critical > >>>>>>>>> for ensuring the Dataflow SDK and model are truly > community-driven, > >>>>>>>>> positively impactful, and innovative open source software. While > >>>>>>>>> Google > >>>>>>>>> has > >>>>>>>>> taken a number of steps to advance its various open source > >>>>>>>>> projects, > >>>>>>>>> we > >>>>>>>>> believe Dataflow is a great fit for the Apache Software > Foundation > >>>>>>>>> due > >>>>>>>>> to > >>>>>>>>> its focus on data processing and its relationships to existing > ASF > >>>>>>>>> projects. > >>>>>>>>> > >>>>>>>>> == Documentation == > >>>>>>>>> > >>>>>>>>> The following documentation is relevant to this proposal. > Relevant > >>>>>>>>> portion > >>>>>>>>> of the documentation will be contributed to the Apache Dataflow > >>>>>>>>> project. > >>>>>>>>> > >>>>>>>>> * Dataflow website: https://cloud.google.com/dataflow > >>>>>>>>> > >>>>>>>>> * Dataflow programming model: > >>>>>>>>> https://cloud.google.com/dataflow/model/programming-model > >>>>>>>>> > >>>>>>>>> * Codebases > >>>>>>>>> > >>>>>>>>> ** Dataflow Java SDK: > >>>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK > >>>>>>>>> > >>>>>>>>> ** Flink Dataflow runner: > >>>>>>>>> https://github.com/dataArtisans/flink-dataflow > >>>>>>>>> > >>>>>>>>> ** Spark Dataflow runner: > >>>>>>>>> https://github.com/cloudera/spark-dataflow > >>>>>>>>> > >>>>>>>>> * Dataflow Java SDK issue tracker: > >>>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues > >>>>>>>>> > >>>>>>>>> * google-cloud-dataflow tag on Stack Overflow: > >>>>>>>>> http://stackoverflow.com/questions/tagged/google-cloud-dataflow > >>>>>>>>> > >>>>>>>>> == Initial Source == > >>>>>>>>> > >>>>>>>>> The initial source for Dataflow which we will submit to the > Apache > >>>>>>>>> Foundation will include several related projects which are > >>>>>>>>> currently > >>>>>>>>> hosted > >>>>>>>>> on the GitHub repositories: > >>>>>>>>> > >>>>>>>>> * Dataflow Java SDK ( > >>>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK) > >>>>>>>>> > >>>>>>>>> * Flink Dataflow runner ( > >>>>>>>>> https://github.com/dataArtisans/flink-dataflow) > >>>>>>>>> > >>>>>>>>> * Spark Dataflow runner ( > >>>>>>>>> https://github.com/cloudera/spark-dataflow) > >>>>>>>>> > >>>>>>>>> These projects have always been Apache 2.0 licensed. We intend to > >>>>>>>>> bundle > >>>>>>>>> all of these repositories since they are all complimentary and > >>>>>>>>> should > >>>>>>>>> be > >>>>>>>>> maintained in one project. Prior to our submission, we will > combine > >>>>>>>>> all > >>>>>>>>> of > >>>>>>>>> these projects into a new git repository. > >>>>>>>>> > >>>>>>>>> == Source and Intellectual Property Submission Plan == > >>>>>>>>> > >>>>>>>>> The source for the Dataflow SDK and the three runners (Spark, > >>>>>>>>> Flink, > >>>>>>>>> Google > >>>>>>>>> Cloud Dataflow) are already licensed under an Apache 2 license. > >>>>>>>>> > >>>>>>>>> * Dataflow SDK - > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/LICENSE > >>>>>>>>> > >>>>>>>>> * Flink runner - > >>>>>>>>> > https://github.com/dataArtisans/flink-dataflow/blob/master/LICENSE > >>>>>>>>> > >>>>>>>>> * Spark runner - > >>>>>>>>> https://github.com/cloudera/spark-dataflow/blob/master/LICENSE > >>>>>>>>> > >>>>>>>>> Contributors to the Dataflow SDK have also signed the Google > >>>>>>>>> Individual > >>>>>>>>> Contributor License Agreement ( > >>>>>>>>> https://cla.developers.google.com/about/google-individual) in > >>>>>>>>> order > >>>>>>>>> to > >>>>>>>>> contribute to the project. > >>>>>>>>> > >>>>>>>>> With respect to trademark rights, Google does not hold a > trademark > >>>>>>>>> on > >>>>>>>>> the > >>>>>>>>> phrase “Dataflow.” Based on feedback and guidance we receive > during > >>>>>>>>> the > >>>>>>>>> incubation process, we are open to renaming the project if > >>>>>>>>> necessary > >>>>>>>>> for > >>>>>>>>> trademark or other concerns. > >>>>>>>>> > >>>>>>>>> == External Dependencies == > >>>>>>>>> > >>>>>>>>> All external dependencies are licensed under an Apache 2.0 or > >>>>>>>>> Apache-compatible license. As we grow the Dataflow community we > >>>>>>>>> will > >>>>>>>>> configure our build process to require and validate all > >>>>>>>>> contributions > >>>>>>>>> and > >>>>>>>>> dependencies are licensed under the Apache 2.0 license or are > under > >>>>>>>>> an > >>>>>>>>> Apache-compatible license. > >>>>>>>>> > >>>>>>>>> == Required Resources == > >>>>>>>>> > >>>>>>>>> === Mailing Lists === > >>>>>>>>> > >>>>>>>>> We currently use a mix of mailing lists. We will migrate our > >>>>>>>>> existing > >>>>>>>>> mailing lists to the following: > >>>>>>>>> > >>>>>>>>> * d...@dataflow.incubator.apache.org > >>>>>>>>> > >>>>>>>>> * u...@dataflow.incubator.apache.org > >>>>>>>>> > >>>>>>>>> * priv...@dataflow.incubator.apache.org > >>>>>>>>> > >>>>>>>>> * comm...@dataflow.incubator.apache.org > >>>>>>>>> > >>>>>>>>> === Source Control === > >>>>>>>>> > >>>>>>>>> The Dataflow team currently uses Git and would like to continue > to > >>>>>>>>> do > >>>>>>>>> so. > >>>>>>>>> We request a Git repository for Dataflow with mirroring to GitHub > >>>>>>>>> enabled. > >>>>>>>>> > >>>>>>>>> === Issue Tracking === > >>>>>>>>> > >>>>>>>>> We request the creation of an Apache-hosted JIRA. The Dataflow > >>>>>>>>> project > >>>>>>>>> is > >>>>>>>>> currently using both a public GitHub issue tracker and internal > >>>>>>>>> Google > >>>>>>>>> issue tracking. We will migrate and combine from these two > sources > >>>>>>>>> to > >>>>>>>>> the > >>>>>>>>> Apache JIRA. > >>>>>>>>> > >>>>>>>>> == Initial Committers == > >>>>>>>>> > >>>>>>>>> * Aljoscha Krettek [aljos...@apache.org] > >>>>>>>>> > >>>>>>>>> * Amit Sela [amitsel...@gmail.com] > >>>>>>>>> > >>>>>>>>> * Ben Chambers [bchamb...@google.com] > >>>>>>>>> > >>>>>>>>> * Craig Chambers [chamb...@google.com] > >>>>>>>>> > >>>>>>>>> * Dan Halperin [dhalp...@google.com] > >>>>>>>>> > >>>>>>>>> * Davor Bonaci [da...@google.com] > >>>>>>>>> > >>>>>>>>> * Frances Perry [f...@google.com] > >>>>>>>>> > >>>>>>>>> * James Malone [jamesmal...@google.com] > >>>>>>>>> > >>>>>>>>> * Jean-Baptiste Onofré [jbono...@apache.org] > >>>>>>>>> > >>>>>>>>> * Josh Wills [jwi...@apache.org] > >>>>>>>>> > >>>>>>>>> * Kostas Tzoumas [kos...@data-artisans.com] > >>>>>>>>> > >>>>>>>>> * Kenneth Knowles [k...@google.com] > >>>>>>>>> > >>>>>>>>> * Luke Cwik [lc...@google.com] > >>>>>>>>> > >>>>>>>>> * Maximilian Michels [m...@apache.org] > >>>>>>>>> > >>>>>>>>> * Stephan Ewen [step...@data-artisans.com] > >>>>>>>>> > >>>>>>>>> * Tom White [t...@cloudera.com] > >>>>>>>>> > >>>>>>>>> * Tyler Akidau [taki...@google.com] > >>>>>>>>> > >>>>>>>>> == Affiliations == > >>>>>>>>> > >>>>>>>>> The initial committers are from six organizations. Google > developed > >>>>>>>>> Dataflow and the Dataflow SDK, data Artisans developed the Flink > >>>>>>>>> runner, > >>>>>>>>> and Cloudera (Labs) developed the Spark runner. > >>>>>>>>> > >>>>>>>>> * Cloudera > >>>>>>>>> > >>>>>>>>> ** Tom White > >>>>>>>>> > >>>>>>>>> * Data Artisans > >>>>>>>>> > >>>>>>>>> ** Aljoscha Krettek > >>>>>>>>> > >>>>>>>>> ** Kostas Tzoumas > >>>>>>>>> > >>>>>>>>> ** Maximilian Michels > >>>>>>>>> > >>>>>>>>> ** Stephan Ewen > >>>>>>>>> > >>>>>>>>> * Google > >>>>>>>>> > >>>>>>>>> ** Ben Chambers > >>>>>>>>> > >>>>>>>>> ** Dan Halperin > >>>>>>>>> > >>>>>>>>> ** Davor Bonaci > >>>>>>>>> > >>>>>>>>> ** Frances Perry > >>>>>>>>> > >>>>>>>>> ** James Malone > >>>>>>>>> > >>>>>>>>> ** Kenneth Knowles > >>>>>>>>> > >>>>>>>>> ** Luke Cwik > >>>>>>>>> > >>>>>>>>> ** Tyler Akidau > >>>>>>>>> > >>>>>>>>> * PayPal > >>>>>>>>> > >>>>>>>>> ** Amit Sela > >>>>>>>>> > >>>>>>>>> * Slack > >>>>>>>>> > >>>>>>>>> ** Josh Wills > >>>>>>>>> > >>>>>>>>> * Talend > >>>>>>>>> > >>>>>>>>> ** Jean-Baptiste Onofré > >>>>>>>>> > >>>>>>>>> == Sponsors == > >>>>>>>>> > >>>>>>>>> === Champion === > >>>>>>>>> > >>>>>>>>> * Jean-Baptiste Onofre [jbono...@apache.org] > >>>>>>>>> > >>>>>>>>> === Nominated Mentors === > >>>>>>>>> > >>>>>>>>> * Jim Jagielski [j...@apache.org] > >>>>>>>>> > >>>>>>>>> * Venkatesh Seetharam [venkat...@apache.org] > >>>>>>>>> > >>>>>>>>> * Bertrand Delacretaz [bdelacre...@apache.org] > >>>>>>>>> > >>>>>>>>> * Ted Dunning [tdunn...@apache.org] > >>>>>>>>> > >>>>>>>>> === Sponsoring Entity === > >>>>>>>>> > >>>>>>>>> The Apache Incubator > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> -- > >>>>>>>> > >>>>>>>> Jean-Baptiste Onofré > >>>>>>> jbono...@apache.org > >>>>>>> http://blog.nanthrax.net > >>>>>>> Talend - http://www.talend.com > >>>>>>> > >>>>>>> > --------------------------------------------------------------------- > >>>>>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > >>>>>>> For additional commands, e-mail: general-h...@incubator.apache.org > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>> > >>>>> Jean-Baptiste Onofré > >>>>> jbono...@apache.org > >>>>> http://blog.nanthrax.net > >>>>> Talend - http://www.talend.com > >>>>> > >>>>> --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > >>>>> For additional commands, e-mail: general-h...@incubator.apache.org > >>>>> > >>>>> > >>>>> > >>>>> > >>>> -- > >>> Jean-Baptiste Onofré > >>> jbono...@apache.org > >>> http://blog.nanthrax.net > >>> Talend - http://www.talend.com > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > >>> For additional commands, e-mail: general-h...@incubator.apache.org > >>> > >>> > >>> > >> > > -- > > Jean-Baptiste Onofré > > jbono...@apache.org > > http://blog.nanthrax.net > > Talend - http://www.talend.com > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > > > -- > Thanks and Regards, > Mayank > Cell: 408-718-9370 > -- Supun Kamburugamuva Member, Apache Software Foundation; http://www.apache.org E-mail: supu...@gmail.com; Mobile: +1 812 369 6762 Blog: http://supunk.blogspot.com