Hi Jean,
Nice Proposal.
I wanted to contribute to this project. Can you please add me too?
Thanks a lot for the help
Thanks,
Mayank
On Thu, Jan 21, 2016 at 8:07 AM, Jean-Baptiste Onofré <j...@nanthrax.net
<mailto:j...@nanthrax.net>> wrote:
Hey Alex,
awesome: I added you on the proposal.
Thanks,
Regards
JB
On 01/21/2016 05:03 PM, Alexander Bezzubov wrote:
Hi,
it's great to see DataFlow becoming part to Apache ecosystem,
thank you
bringing it in.
I would be happy to get involved and help.
--
Alex
On Thu, Jan 21, 2016 at 8:42 PM, Jean-Baptiste Onofré
<j...@nanthrax.net <mailto:j...@nanthrax.net>>
wrote:
Perfect: done, you are on the proposal.
Thanks !
Regards
JB
On 01/21/2016 11:55 AM, chatz wrote:
Charitha Elvitigala
On 21 January 2016 at 16:17, Jean-Baptiste Onofré
<j...@nanthrax.net <mailto:j...@nanthrax.net>>
wrote:
Hi Chatz,
sure, what name should I use on the proposal, Charitha ?
Regards
JB
On 01/21/2016 11:32 AM, chatz wrote:
Hi Jean,
I’d be interested in contributing as well.
Thanks,
Chatz
On 21 January 2016 at 14:22, Jean-Baptiste
Onofré <j...@nanthrax.net <mailto:j...@nanthrax.net>>
wrote:
Sweet: you are on the proposal ;)
Thanks !
Regards
JB
On 01/21/2016 08:55 AM, Byung-Gon Chun wrote:
This looks very interesting. I'm interested
in contributing.
Thanks.
-Gon
---
Byung-Gon Chun
On Thu, Jan 21, 2016 at 1:32 AM, James
Malone <
jamesmal...@google.com.invalid> wrote:
Hello everyone,
Attached to this message is a
proposed new project - Apache
Dataflow, a
unified programming model for data
processing and integration.
The text of the proposal is included
below. Additionally, the
proposal
is
in draft form on the wiki where we
will make any required changes:
https://wiki.apache.org/incubator/DataflowProposal
We look forward to your feedback and
input.
Best,
James
----
= Apache Dataflow =
== Abstract ==
Dataflow is an open source, unified
model and set of
language-specific
SDKs
for defining and executing data
processing workflows, and also data
ingestion and integration flows,
supporting Enterprise Integration
Patterns
(EIPs) and Domain Specific Languages
(DSLs). Dataflow pipelines
simplify
the mechanics of large-scale batch
and streaming data processing and
can
run on a number of runtimes like
Apache Flink, Apache Spark, and
Google
Cloud Dataflow (a cloud service).
Dataflow also brings DSL in
different
languages, allowing users to easily
implement their data integration
processes.
== Proposal ==
Dataflow is a simple, flexible, and
powerful system for distributed
data
processing at any scale. Dataflow
provides a unified programming
model, a
software development kit to define
and construct data processing
pipelines,
and runners to execute Dataflow
pipelines in several runtime engines,
like
Apache Spark, Apache Flink, or
Google Cloud Dataflow. Dataflow can be
used
for a variety of streaming or batch
data processing goals including
ETL,
stream analysis, and aggregate
computation. The underlying
programming
model for Dataflow provides
MapReduce-like parallelism, combined
with
support for powerful data windowing,
and fine-grained correctness
control.
== Background ==
Dataflow started as a set of Google
projects focused on making data
processing easier, faster, and less
costly. The Dataflow model is a
successor to MapReduce, FlumeJava,
and Millwheel inside Google and is
focused on providing a unified
solution for batch and stream
processing.
These projects on which Dataflow is
based have been published in
several
papers made available to the public:
* MapReduce -
http://research.google.com/archive/mapreduce.html
* Dataflow model -
http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
* FlumeJava -
http://notes.stephenholiday.com/FlumeJava.pdf
* MillWheel -
http://research.google.com/pubs/pub41378.html
Dataflow was designed from the start
to provide a portable
programming
layer. When you define a data
processing pipeline with the Dataflow
model,
you are creating a job which is
capable of being processed by any
number
of
Dataflow processing engines. Several
engines have been developed to
run
Dataflow pipelines in other open
source runtimes, including a
Dataflow
runner for Apache Flink and Apache
Spark. There is also a “direct
runner”,
for execution on the developer
machine (mainly for dev/debug
purposes).
Another runner allows a Dataflow
program to run on a managed service,
Google Cloud Dataflow, in Google
Cloud Platform. The Dataflow Java
SDK
is
already available on GitHub, and
independent from the Google Cloud
Dataflow
service. Another Python SDK is
currently in active development.
In this proposal, the Dataflow SDKs,
model, and a set of runners will
be
submitted as an OSS project under
the ASF. The runners which are a
part
of
this proposal include those for
Spark (from Cloudera), Flink (from
data
Artisans), and local development
(from Google); the Google Cloud
Dataflow
service runner is not included in
this proposal. Further references
to
Dataflow will refer to the Dataflow
model, SDKs, and runners which
are
a
part of this proposal (Apache
Dataflow) only. The initial submission
will
contain the already-released Java
SDK; Google intends to submit the
Python
SDK later in the incubation process.
The Google Cloud Dataflow
service
will
continue to be one of many runners
for Dataflow, built on Google
Cloud
Platform, to run Dataflow pipelines.
Necessarily, Cloud Dataflow will
develop against the Apache project
additions, updates, and changes.
Google
Cloud Dataflow will become one user
of Apache Dataflow and will
participate
in the project openly and publicly.
The Dataflow programming model has
been designed with simplicity,
scalability, and speed as key
tenants. In the Dataflow model, you
only
need
to think about four top-level
concepts when constructing your data
processing job:
* Pipelines - The data processing
job made of a series of
computations
including input, processing, and output
* PCollections - Bounded (or
unbounded) datasets which represent the
input,
intermediate and output data in
pipelines
* PTransforms - A data processing
step in a pipeline in which one or
more
PCollections are an input and output
* I/O Sources and Sinks - APIs for
reading and writing data which are
the
roots and endpoints of the pipeline
== Rationale ==
With Dataflow, Google intended to
develop a framework which allowed
developers to be maximally
productive in defining the
processing, and
then
be able to execute the program at
various levels of
latency/cost/completeness without
re-architecting or re-writing it.
This
goal was informed by Google’s past
experience developing several
models,
frameworks, and tools useful for
large-scale and distributed data
processing. While Google has
previously published papers describing
some
of
its technologies, Google decided to
take a different approach with
Dataflow. Google open-sourced the
SDK and model alongside
commercialization
of the idea and ahead of publishing
papers on the topic. As a
result, a
number of open source runtimes exist
for Dataflow, such as the Apache
Flink
and Apache Spark runners.
We believe that submitting Dataflow
as an Apache project will provide
an
immediate, worthwhile, and
substantial contribution to the open
source
community. As an incubating project,
we believe Dataflow will have a
better
opportunity to provide a meaningful
contribution to OSS and also
integrate
with other Apache projects.
In the long term, we believe
Dataflow can be a powerful abstraction
layer
for data processing. By providing an
abstraction layer for data
pipelines
and processing, data workflows can
be increasingly portable,
resilient
to
breaking changes in tooling, and
compatible across many execution
engines,
runtimes, and open source projects.
== Initial Goals ==
We are breaking our initial goals
into immediate (< 2 months),
short-term
(2-4 months), and intermediate-term
(> 4 months).
Our immediate goals include the
following:
* Plan for reconciling the Dataflow
Java SDK and various runners into
one
project
* Plan for refactoring the existing
Java SDK for better extensibility
by
SDK and runner writers
* Validating all dependencies are
ASL 2.0 or compatible
* Understanding and adapting to the
Apache development process
Our short-term goals include:
* Moving the newly-merged lists, and
build utilities to Apache
* Start refactoring codebase and
move code to Apache Git repo
* Continue development of new
features, functions, and fixes in the
Dataflow Java SDK, and Dataflow runners
* Cleaning up the Dataflow SDK
sources and crafting a roadmap and
plan
for
how to include new major ideas,
modules, and runtimes
* Establishment of easy and clear
build/test framework for Dataflow
and
associated runtimes; creation of
testing, rollback, and validation
policy
* Analysis and design for work
needed to make Dataflow a better data
processing abstraction layer for
multiple open source frameworks and
environments
Finally, we have a number of
intermediate-term goals:
* Roadmapping, planning, and
execution of integrations with other OSS
and
non-OSS projects/products
* Inclusion of additional SDK for
Python, which is under active
development
== Current Status ==
=== Meritocracy ===
Dataflow was initially developed
based on ideas from many employees
within
Google. As an ASL OSS project on
GitHub, the Dataflow SDK has
received
contributions from data Artisans,
Cloudera Labs, and other individual
developers. As a project under
incubation, we are committed to
expanding
our effort to build an environment
which supports a meritocracy. We
are
focused on engaging the community
and other related projects for
support
and contributions. Moreover, we are
committed to ensure contributors
and
committers to Dataflow come from a
broad mix of organizations
through a
merit-based decision process during
incubation. We believe strongly
in
the
Dataflow model and are committed to
growing an inclusive community of
Dataflow contributors.
=== Community ===
The core of the Dataflow Java SDK
has been developed by Google for
use
with
Google Cloud Dataflow. Google has
active community engagement in the
SDK
GitHub repository (
https://github.com/GoogleCloudPlatform/DataflowJavaSDK
),
on Stack Overflow (
http://stackoverflow.com/questions/tagged/google-cloud-dataflow)
and
has
had contributions from a number of
organizations and indivuduals.
Everyday, Cloud Dataflow is actively
used by a number of
organizations
and
institutions for batch and stream
processing of data. We believe
acceptance
will allow us to consolidate
existing Dataflow-related work, grow the
Dataflow community, and deepen
connections between Dataflow and other
open
source projects.
=== Core Developers ===
The core developers for Dataflow and
the Dataflow runners are:
* Frances Perry
* Tyler Akidau
* Davor Bonaci
* Luke Cwik
* Ben Chambers
* Kenn Knowles
* Dan Halperin
* Daniel Mills
* Mark Shields
* Craig Chambers
* Maximilian Michels
* Tom White
* Josh Wills
=== Alignment ===
The Dataflow SDK can be used to
create Dataflow pipelines which can
be
executed on Apache Spark or Apache
Flink. Dataflow is also related to
other
Apache projects, such as Apache
Crunch. We plan on expanding
functionality
for Dataflow runners, support for
additional domain specific
languages,
and
increased portability so Dataflow is
a powerful abstraction layer for
data
processing.
== Known Risks ==
=== Orphaned Products ===
The Dataflow SDK is presently used
by several organizations, from
small
startups to Fortune 100 companies,
to construct production pipelines
which
are executed in Google Cloud
Dataflow. Google has a long-term
commitment
to
advance the Dataflow SDK; moreover,
Dataflow is seeing increasing
interest,
development, and adoption from
organizations outside of Google.
=== Inexperience with Open Source ===
Google believes strongly in open
source and the exchange of
information
to
advance new ideas and work. Examples
of this commitment are active
OSS
projects such as Chromium
(https://www.chromium.org) and
Kubernetes
(
http://kubernetes.io/). With
Dataflow, we have tried to be
increasingly
open and forward-looking; we have
published a paper in the VLDB
conference
describing the Dataflow model (
http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf)
and were quick to
release
the Dataflow SDK as open source
software with the launch of Cloud
Dataflow.
Our submission to the Apache
Software Foundation is a logical
extension
of
our commitment to open source software.
=== Homogeneous Developers ===
The majority of committers in this
proposal belong to Google due to
the
fact that Dataflow has emerged from
several internal Google projects.
This
proposal also includes committers
outside of Google who are actively
involved with other Apache projects,
such as Hadoop, Flink, and
Spark.
We
expect our entry into incubation
will allow us to expand the number
of
individuals and organizations
participating in Dataflow development.
Additionally, separation of the
Dataflow SDK from Google Cloud
Dataflow
allows us to focus on the open
source SDK and model and do what is
best
for
this project.
=== Reliance on Salaried Developers ===
The Dataflow SDK and Dataflow
runners have been developed primarily
by
salaried developers supporting the
Google Cloud Dataflow project.
While
the
Dataflow SDK and Cloud Dataflow have
been developed by different
teams
(and
this proposal would reinforce that
separation) we expect our initial
set
of
developers will still primarily be
salaried. Contribution has not
been
exclusively from salaried
developers, however. For example, the
contrib
directory of the Dataflow SDK (
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/tree/master/contrib
)
contains items from free-time
contributors. Moreover, seperate
projects,
such as ScalaFlow
(https://github.com/darkjh/scalaflow) have
been
created
around the Dataflow model and SDK.
We expect our reliance on salaried
developers will decrease over time
during incubation.
=== Relationship with other Apache
products ===
Dataflow directly interoperates with
or utilizes several existing
Apache
projects.
* Build
** Apache Maven
* Data I/O, Libraries
** Apache Avro
** Apache Commons
* Dataflow runners
** Apache Flink
** Apache Spark
Dataflow when used in batch mode
shares similarities with Apache
Crunch;
however, Dataflow is focused on a
model, SDK, and abstraction layer
beyond
Spark and Hadoop (MapReduce.) One
key goal of Dataflow is to provide
an
intermediate abstraction layer which
can easily be implemented and
utilized
across several different processing
frameworks.
=== An excessive fascination with
the Apache brand ===
With this proposal we are not
seeking attention or publicity. Rather,
we
firmly believe in the Dataflow
model, SDK, and the ability to make
Dataflow
a powerful yet simple framework for
data processing. While the
Dataflow
SDK
and model have been open source, we
believe putting code on GitHub
can
only
go so far. We see the Apache
community, processes, and mission as
critical
for ensuring the Dataflow SDK and
model are truly community-driven,
positively impactful, and innovative
open source software. While
Google
has
taken a number of steps to advance
its various open source projects,
we
believe Dataflow is a great fit for
the Apache Software Foundation
due
to
its focus on data processing and its
relationships to existing ASF
projects.
== Documentation ==
The following documentation is
relevant to this proposal. Relevant
portion
of the documentation will be
contributed to the Apache Dataflow
project.
* Dataflow website:
https://cloud.google.com/dataflow
* Dataflow programming model:
https://cloud.google.com/dataflow/model/programming-model
* Codebases
** Dataflow Java SDK:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK
** Flink Dataflow runner:
https://github.com/dataArtisans/flink-dataflow
** Spark Dataflow runner:
https://github.com/cloudera/spark-dataflow
* Dataflow Java SDK issue tracker:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues
* google-cloud-dataflow tag on Stack
Overflow:
http://stackoverflow.com/questions/tagged/google-cloud-dataflow
== Initial Source ==
The initial source for Dataflow
which we will submit to the Apache
Foundation will include several
related projects which are currently
hosted
on the GitHub repositories:
* Dataflow Java SDK (
https://github.com/GoogleCloudPlatform/DataflowJavaSDK)
* Flink Dataflow runner (
https://github.com/dataArtisans/flink-dataflow)
* Spark Dataflow runner
(https://github.com/cloudera/spark-dataflow)
These projects have always been
Apache 2.0 licensed. We intend to
bundle
all of these repositories since they
are all complimentary and should
be
maintained in one project. Prior to
our submission, we will combine
all
of
these projects into a new git
repository.
== Source and Intellectual Property
Submission Plan ==
The source for the Dataflow SDK and
the three runners (Spark, Flink,
Google
Cloud Dataflow) are already licensed
under an Apache 2 license.
* Dataflow SDK -
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/LICENSE
* Flink runner -
https://github.com/dataArtisans/flink-dataflow/blob/master/LICENSE
* Spark runner -
https://github.com/cloudera/spark-dataflow/blob/master/LICENSE
Contributors to the Dataflow SDK
have also signed the Google
Individual
Contributor License Agreement (
https://cla.developers.google.com/about/google-individual)
in order
to
contribute to the project.
With respect to trademark rights,
Google does not hold a trademark on
the
phrase “Dataflow.” Based on feedback
and guidance we receive during
the
incubation process, we are open to
renaming the project if necessary
for
trademark or other concerns.
== External Dependencies ==
All external dependencies are
licensed under an Apache 2.0 or
Apache-compatible license. As we
grow the Dataflow community we will
configure our build process to
require and validate all contributions
and
dependencies are licensed under the
Apache 2.0 license or are under
an
Apache-compatible license.
== Required Resources ==
=== Mailing Lists ===
We currently use a mix of mailing
lists. We will migrate our existing
mailing lists to the following:
* d...@dataflow.incubator.apache.org
<mailto:d...@dataflow.incubator.apache.org>
* u...@dataflow.incubator.apache.org
<mailto:u...@dataflow.incubator.apache.org>
*
priv...@dataflow.incubator.apache.org
<mailto:priv...@dataflow.incubator.apache.org>
*
comm...@dataflow.incubator.apache.org
<mailto:comm...@dataflow.incubator.apache.org>
=== Source Control ===
The Dataflow team currently uses Git
and would like to continue to do
so.
We request a Git repository for
Dataflow with mirroring to GitHub
enabled.
=== Issue Tracking ===
We request the creation of an
Apache-hosted JIRA. The Dataflow
project
is
currently using both a public GitHub
issue tracker and internal
Google
issue tracking. We will migrate and
combine from these two sources to
the
Apache JIRA.
== Initial Committers ==
* Aljoscha Krettek
[aljos...@apache.org
<mailto:aljos...@apache.org>]
* Amit Sela
[amitsel...@gmail.com
<mailto:amitsel...@gmail.com>]
* Ben Chambers
[bchamb...@google.com
<mailto:bchamb...@google.com>]
* Craig Chambers
[chamb...@google.com
<mailto:chamb...@google.com>]
* Dan Halperin
[dhalp...@google.com
<mailto:dhalp...@google.com>]
* Davor Bonaci
[da...@google.com
<mailto:da...@google.com>]
* Frances Perry
[f...@google.com <mailto:f...@google.com>]
* James Malone
[jamesmal...@google.com
<mailto:jamesmal...@google.com>]
* Jean-Baptiste Onofré
[jbono...@apache.org
<mailto:jbono...@apache.org>]
* Josh Wills
[jwi...@apache.org
<mailto:jwi...@apache.org>]
* Kostas Tzoumas
[kos...@data-artisans.com
<mailto:kos...@data-artisans.com>]
* Kenneth Knowles
[k...@google.com <mailto:k...@google.com>]
* Luke Cwik
[lc...@google.com
<mailto:lc...@google.com>]
* Maximilian Michels
[m...@apache.org
<mailto:m...@apache.org>]
* Stephan Ewen
[step...@data-artisans.com
<mailto:step...@data-artisans.com>]
* Tom White
[t...@cloudera.com
<mailto:t...@cloudera.com>]
* Tyler Akidau
[taki...@google.com
<mailto:taki...@google.com>]
== Affiliations ==
The initial committers are from six
organizations. Google developed
Dataflow and the Dataflow SDK, data
Artisans developed the Flink
runner,
and Cloudera (Labs) developed the
Spark runner.
* Cloudera
** Tom White
* Data Artisans
** Aljoscha Krettek
** Kostas Tzoumas
** Maximilian Michels
** Stephan Ewen
* Google
** Ben Chambers
** Dan Halperin
** Davor Bonaci
** Frances Perry
** James Malone
** Kenneth Knowles
** Luke Cwik
** Tyler Akidau
* PayPal
** Amit Sela
* Slack
** Josh Wills
* Talend
** Jean-Baptiste Onofré
== Sponsors ==
=== Champion ===
* Jean-Baptiste Onofre
[jbono...@apache.org
<mailto:jbono...@apache.org>]
=== Nominated Mentors ===
* Jim Jagielski
[j...@apache.org
<mailto:j...@apache.org>]
* Venkatesh Seetharam
[venkat...@apache.org
<mailto:venkat...@apache.org>]
* Bertrand Delacretaz
[bdelacre...@apache.org
<mailto:bdelacre...@apache.org>]
* Ted Dunning
[tdunn...@apache.org
<mailto:tdunn...@apache.org>]
=== Sponsoring Entity ===
The Apache Incubator
--
Jean-Baptiste Onofré
jbono...@apache.org <mailto:jbono...@apache.org>
http://blog.nanthrax.net
Talend - http://www.talend.com
---------------------------------------------------------------------
To unsubscribe, e-mail:
general-unsubscr...@incubator.apache.org
<mailto:general-unsubscr...@incubator.apache.org>
For additional commands, e-mail:
general-h...@incubator.apache.org
<mailto:general-h...@incubator.apache.org>
--
Jean-Baptiste Onofré
jbono...@apache.org <mailto:jbono...@apache.org>
http://blog.nanthrax.net
Talend - http://www.talend.com
---------------------------------------------------------------------
To unsubscribe, e-mail:
general-unsubscr...@incubator.apache.org
<mailto:general-unsubscr...@incubator.apache.org>
For additional commands, e-mail:
general-h...@incubator.apache.org
<mailto:general-h...@incubator.apache.org>
--
Jean-Baptiste Onofré
jbono...@apache.org <mailto:jbono...@apache.org>
http://blog.nanthrax.net
Talend - http://www.talend.com
---------------------------------------------------------------------
To unsubscribe, e-mail:
general-unsubscr...@incubator.apache.org
<mailto:general-unsubscr...@incubator.apache.org>
For additional commands, e-mail:
general-h...@incubator.apache.org
<mailto:general-h...@incubator.apache.org>
--
Jean-Baptiste Onofré
jbono...@apache.org <mailto:jbono...@apache.org>
http://blog.nanthrax.net
Talend - http://www.talend.com
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
<mailto:general-unsubscr...@incubator.apache.org>
For additional commands, e-mail: general-h...@incubator.apache.org
<mailto:general-h...@incubator.apache.org>
--
Thanks and Regards,
Mayank
Cell: 408-718-9370