[VOTE] Accept Wayang into the Apache Incubator

Christofer Dutz Fri, 11 Dec 2020 08:33:41 -0800

Hi all,

following up the [DISCUSS] thread on Wayang 
(https://lists.apache.org/thread.html/r5fc03ae014f44c7c31a509a6db4ac07faedb2e1c6245cd917b744826%40%3Cgeneral.incubator.apache.org%3E)
 I would like to call a VOTE to accept Wayang Aka Rheem into the Apache 
Incubator.


Please cast your vote:

  [ ] +1, bring Wayang into the Incubator
  [ ] +0, I don't care either way
  [ ] -1, do not bring Wayang into the Incubator, because...

The vote will open at least for 72 hours and only votes from the Incubator PMC 
are binding, but votes from everyone are welcome.

Chris

-----

Wayang Proposal 
(https://cwiki.apache.org/confluence/display/INCUBATOR/WayangProposal)

== Abstract ==

Wayang is a cross-platform data processing system that aims at decoupling the 
business logic of data analytics applications from concrete data processing 
platforms, such as Apache Flink or Apache Spark. Hence, it tames the complexity 
that arises from the "Cambrian explosion" of novel data processing platforms 
that we currently witness.

Note that Wayang project is the Rheem project, but we have renamed the project 
because of trademark issues.

You can find the project web page at: https://rheem-ecosystem.github.io/

= Proposal =

Wayang is a cross-platform system that provides an abstraction over data 
processing platforms to free users from the burdens of (i) performing tedious 
and costly data migration and integration tasks to run their applications, and 
(ii) choosing the right data processing platforms for their applications. To 
achieve this, Wayang: (1) provides an abstraction on top of existing data 
processing platforms that allows users to specify their data analytics tasks in 
a form of a DAG of operators; (2) comes with a cross-platform optimizer for 
automating the selection of suitable/efficient platforms; and (3) and finally 
takes care of executing the optimized plan, including communication across 
platforms. In summary, Wayang has the following salient features:

- Flexible Data Model - It considers a flexible and simple data model based on 
data quanta. A data quantum is an atomic processing unit in the system, that 
can represent a large spectrum of data formats, such as data points for a 
machine learning application, tuples for a database application, or RDF 
triples. Hence, Wayang is able to express a wide range of data analytics tasks.
- Platform independence - It provides a simple interface (currently Java and 
Scala) that is inspired by established programming models, such as that of 
Apache Spark and Apache Flink. Users represent their data analytic tasks as a 
DAG (Wayang plan), where vertices correspond to Wayang operators and edges 
represent data flows (data quanta flowing) among these operators. A Wayang 
operator defines a particular kind of data transformation over an input data 
quantum, ranging from basic functionality (e.g., transformations, filters, 
joins) to complex, extensible tasks (e.g., PageRank).
- Cross-platform execution - Besides running a data analytic task on any data 
processing platform, it also comes with an optimizer that can decide to execute 
a single data analytic task using multiple data processing platforms. This 
allows for exploiting the capabilities of different data processing platforms 
to perform complex data analytic tasks more efficiently.
Self-tuning UDF-based cost model - Its optimizer uses a cost model fully based 
on UDFs. This not only enables Wayang to learn the cost functions of newly 
added data processing platforms, but also allows developers to tune the 
optimizer at will.
- Extensibility - It treats data processing platforms as plugins to allow users 
(developers) to easily incorporate new data processing platforms into the 
system. This is achieved by exposing the functionalities of data processing 
platforms as operators (execution operators). The same approach is followed at 
the Wayang interface, where users can also extend Wayang capabilities, i.e., 
the operators, easily.

We plan to work on the stability of all these features as well as extending 
Wayang with more advanced features. Furthermore, Wayang currently supports 
Apache Spark, Standalone Java, GraphChi, relational databases (via JDBC). We 
plan to incorporate more data processing platforms, such as Apache Flink and 
Apache Hive.

=== Background ===

Many organizations and companies collect or produce large variety of data to 
apply data analytics over them. This is because insights from data rapidly 
allow them to make better decisions. Thus, the pursuit for efficient and 
scalable data analytics as well as the one-size-does-not-fit-all philosophy has 
given rise to a plethora of data processing platforms. Examples of these 
specialized processing platforms range from DBMSs to MapReduce-like platforms.

However, today's data analytics are moving beyond the limits of a single data 
processing platform. More and more applications need to perform complex data 
analytics over several data processing platforms. For example, IBM reported 
that North York hospital needs to process 50 diverse datasets, which are on a 
dozen different internal systems, (ii) oil & gas companies stated they need to 
process large amounts of data they produce everyday, e.g., a single oil company 
can produce more than 1.5TB of diverse (structured and unstructured) data per 
day, (iii) Fortune magazine stated that airlines need to analyze large 
datasets, which are produced by different departments, are of different data 
formats, and reside on multiple data sources, to produce global reports for 
decision makers, and (iv) Hewlett Packard has claimed that, according to its 
customer portfolio, business intelligence typically require a single analytics 
pipeline using different processing platforms at different parts of the 
pipeline. These are just a few examples of emerging applications that require a 
diversity of data processing platforms.

Today, developers have to deal with this myriad of data processing platforms. 
That is, they have to choose the right data processing platform for their 
applications (or data analytic tasks) and to familiarize with the intricacies 
of the different platforms to achieve high efficiency and scalability. Several 
systems have also appeared with the goal of helping users to easily glue 
several platforms together, such as Apache Drill, PrestoDB, and Luigi. 
Nevertheless, all these systems still require quite good expertise from users 
to decide which data processing platforms to use for the data analytic task at 
hand. In consequence, great engineering effort is required to unify the data 
from various sources, to combine the processing capabilities of different 
platforms, and to maintain those applications, so as to unleash the full 
potential of the data. In the worst case, such applications are not built in 
the first place, as it seems too much of a daunting endeavor.

=== Rationale ===

It is evident that there is an urgent need to release developers from the 
burden of knowing all the intricacies of choosing and glueing together data 
processing platforms for supporting their applications (data analytic tasks). 
Developers must focus only on the logics of their applications. Surprisingly, 
there is no open source system trying to satisfy this urgent need. Wayang aims 
at filling this gap. It copes with this urgent need by providing both a common 
interface over data processing platforms and an optimizer to execute data 
analytic tasks on the right data processing platform(s) seamlessly. As Apache 
is the place where most of the important big data systems are, we then consider 
Apache as the right place for Wayang.

=== Current Status ===

The current version of Wayang (v0.5.0) was initially co-developed by staff, 
students, and interns at the Qatar Computing Research Institute (QCRI) and the 
Hasso-Plattner Institute (HPI). The project was initiated at and sponsored by 
QCRI in 2015 with the goal of freeing data scientists and developers from the 
intricacies of data processing platforms to support their analytic tasks. The 
first open source release of Wayang was made only one year and a half later, in 
June 13th of 2016, under the Apache Software License 2.0. Since we have made 
several releases, the latest release was done on January 23th, 2019.

** Meritocracy **

All current Wayang developers are familiar with this development process at 
Apache and are currently trying to follow this meritocracy process as much as 
possible. For example, Wayang already follows a committer principle where any 
pull request is analyzed by at least one Wayang core developer. This was one of 
the reasons for choosing Apache for Wayang as we all want to encourage and keep 
this style of development for Wayang.

** Community **

Wayang started as a pure research project, but it quickly started developing 
into a community. People from HPI quickly joined our efforts almost from the 
very beginning to make this project a reality. Recently, the Berlin Institute 
of Technology (TU Berlin) and the Pontifical Catholic University of Valparaiso 
(PUCV) in Chile have also joined our efforts for developing Wayang. A company, 
called Scalytics, has been created around Wayang. Currently, we are intensively 
seeking to further develop both developer and user communities. To keep 
broadening the community, we plan to also exploit our ongoing academic 
collaborations with multiple universities in Berlin and companies that we 
collaborate with. For instance, Wayang is already being utilized for accessing 
multiple data sources in the context of a large data analytics project led by 
TU Berlin and Huawei. We also believe that Wayang's extensible architecture 
(i.e., adding new operators and platforms) will further encourage community 
participation. During incubation we plan to have Wayang adopted by at least one 
company and will explicitly seek more industrial participation.

** Core Developers **

The initial developers of the project are diverse, they are from four different 
institutions (TU Berlin, Scalytics, PUCV, and HBKU). We will work aggressively 
to grow the community during the incubation by recruiting more developers from 
other institutions.

** Alignment **

We believe Apache is the most natural home for taking Wayang to the next level. 
Apache is currently hosting the most important big data systems. Hadoop, Spark, 
Flink, HBase, Hive, Tez, Reef, Storm, Drill, and Ignite are just some examples 
of these technologies. Wayang fills a significant gap - it provides a common 
abstraction for all these platforms and decides on which platforms to run a 
single data analytic task - that exist in the big data open source world. 
Wayang is now being developed following the Apache-style development model. 
Also, it is well-aligned with the Apache principle of building a community to 
impact the big data community.

=== Known Risks ===

** Orphaned Products **

Currently, Wayang is the core technology behind Scalytics inc.. As a result, a 
team of two engineers are working on a full time basis on this project. 
Recently, three more developers have joined our efforts in building Wayang. 
Thus, the risk of Wayang becoming orphaned is relatively very low. Still, 
people outside Scalytics (from TU Berlin and HBKU) have also joined the 
project, which makes the risk of abandoning the project even lower. The PUCV in 
Chile is also beginning to contribute to the code base and to develop a 
declarative query language on top of Wayang. The project is constantly being 
monitored by email and frequent Skype meetings as well as by weekly meetings 
with Scalytics people. Additionally, at the end of each year, we meet to 
discuss the status of the project as well as to plan the most important aspects 
we should work on during the year after.

** Inexperience with Open Source **

Wayang quickly started being developed in open source under the Apache Software 
License 2.0. The source code is available on Github. Also few of the initial 
committers have contributed to other open source projects: Hadoop and Flume

** Homogeneous Developers **

The initial committers are already geographically distributed among Chile, 
Germany, and Qatar. During incubation, one of our main goals is to increase the 
heterogeneity of the current community and we will work hard to achieve it.

** Reliance on salaried developers **

Wayang is already being developed by a mix of full time and volunteer time. 
Only 2 of the initial committers are working full time on this project 
(Scalytics). So, we are confident that the project will not decrease its 
development pace. Furthermore, we are committed to recruit additional 
committers to significantly increase the development pace of the project.

** Relationships with other Apache products **

Wayang is somehow related to Apache Spark as its developing interface is 
inspired from Spark. In contrast to Spark, Wayang is not a data processing 
platform, but a mediator between user applications and data processing 
platforms. In this sense, Wayang is similar to the Apache Drill project, and 
Apache Beam. However, Wayang significantly differs from Apache Drill in two 
main aspects. First, Apache Drill provides only a common interface to query 
multiple data storages and hence users have to specify in their query the data 
to fetch. Then, Apache Drill translates the query to the processing platforms 
where the data is stored, e.g. into mongoDB query representation. In contrast, 
in Wayang, users only specify the data path and Wayang decides which are the 
best (performance-wise) data processing platforms to use to process such data. 
Second, the query interface in Apache Drill is SQL. Wayang uses an interface 
based on operators forming DAGs. In this latter point, we are currently 
developing a PIGLatin-like query language for Wayang. In addition, in contrast 
to Apache Beam, Wayang not only allows users to use multiple data processing 
platforms at the same time, but also it provides an optimizer to choose the 
most efficient platform for the task at hand. In Apache Beam, users have to 
specify an appropriate runner (platform).
Given these similarities with the two Apache projects mentioned above, we are 
looking forward to collaborating with those communities. Still, we are open and 
would also love to collaborate with other Apache communities as well.
** An excessive fascination with the Apache Brand **

Wayang solves a real problem that currently users and developers have to deal 
with at a high cost: monetary cost, high design and development efforts, and 
very time consuming. Therefore, we believe that Wayang can be successful in 
building a large community around it. We are convinced that the Apache brand 
and community process will significantly help us in building such a community 
and to establish the project in the long-term. We simply believe that ASF is 
the right home for Wayang to achieve this.

=== Documentation ===

Further details, documentation, and publications related to Wayang can be found 
at https://docs.rheem.io/rheem/

=== Initial Source ===

The current source code of Wayang resides in Github:
https://github.com/rheem-ecosystem/rheem

=== External Dependencies ===

Wayang depends on the following Apache projects:

* Maven
* HDFS
* Hadoop
* Spark

Wayang depends on the following other open source projects organized by license:

org.json.json: Json (http://json.org/license.html) 
SnakeYAML: Apache 2.0
Java Unified Expression Language API (Juel): Apache 2.0
ProfileDB Instrumentation: Apache 2.0
Gson: Apache 2.0
Hadoop: Apache 2.0
Scala: Apache 2.0
Antlr 4: BSD
Jackson: Apache 2.0
Junit 5: EPL 2.0
Mockito: MIT
Assertj: Apache 2.0
logback-classic: EPL 1.0 LGPL 2.1
slf4j: MIT
GNU Trove: LGPL 2.1
graphchi: Apache 2.0
SQLite JDBC: Apache 2.0
PostgreSQL: BSD 2-clause
jcommander: Apache 2.0
Koloboke Collections API: Apache 2.0
Snappy Java: Apache 2.0
Apache Spark: Apache 2.0
HyperSQL Database: BSD Modified (http://hsqldb.org/web/hsqlLicense.html) 
Apache Giraph: Apache 2.0
Apache Flink: Apache 2.0
Apache Commons IO: Apache 2.0
Apache Commons Lang: Apache 2.0
Apache Maven: Apache 2.0

=== Cryptography ===

(not applicable)

=== Required Resources ===

** Mailing Lists **

* mailto:priv...@wayang.incubator.apache.org
* mailto:d...@wayang.incubator.apache.org
* mailto:comm...@wayang.incubator.apache.org

** Git repositories **

git://git.apache.org/repos/asf/incubator/wayang

** Issue tracking **

https://issues.apache.org/jira/browse/RHEEM

=== Initial Committers ===

The following list gives the planned initial committers (in alphabetical order):

* Bertty Contreras-Rojas <bertty@http://scalytics.io>
* Rodrigo Pardo-Meza <rodrigo@http://scalytics.io>
* Alexander Alten-Lorenz <alo@http://scalytics.io>
* Zoi Kaoudi <zoi.kaoudi@http://tu-berlin.de>
* Haralampos Gavriilidis <gavriilidis@http://tu-berlin.de>
* Jorge-Arnulfo Quiane-Ruiz <jorge.quiane@http://tu-berlin.de>
* Anis Troudi <atroudi@http://hbku.edu.qa>
* Wenceslao Palma-Muñoz <wenceslao.palma@http://pucv.cl>

** Affiliations **

* Scalytics Inc.
** Bertty Contreras-Rojas
** Rodrigo Pardo-Meza
** Alexander Alten-Lorenz
* Berlin Institute of Technology (TU Berlin)
** Zoi Kaoudi
** Haralampos Gavriilidis
** Jorge-Arnulfo Quiane-Ruiz
* Hamad Bin Khalifa University (HBKU)
** Anis Troudi
* Pontifical Catholic University of Valparaiso, Chile (PUCV)
** Wenceslao Palma-Muñoz

=== Sponsors ===

** Champion **

* Christofer Dutz (christofer.dutz at c-ware dot de)

** Mentors **

. (cdutz) Christofer Dutz
. (larsgeorge) Lars George
. (berndf) Fondermann
. (jbonofre) Jean-Baptiste Onofré

** Sponsoring Entity **

The Apache Incubator









---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

[VOTE] Accept Wayang into the Apache Incubator

Reply via email to