+1 (binding) Sent from my iPhone
> On Dec 11, 2020, at 8:33 AM, Christofer Dutz <christofer.d...@c-ware.de> > wrote: > > Hi all, > > following up the [DISCUSS] thread on Wayang > (https://lists.apache.org/thread.html/r5fc03ae014f44c7c31a509a6db4ac07faedb2e1c6245cd917b744826%40%3Cgeneral.incubator.apache.org%3E) > I would like to call a VOTE to accept Wayang Aka Rheem into the Apache > Incubator. > > Please cast your vote: > > [ ] +1, bring Wayang into the Incubator > [ ] +0, I don't care either way > [ ] -1, do not bring Wayang into the Incubator, because... > > The vote will open at least for 72 hours and only votes from the Incubator > PMC are binding, but votes from everyone are welcome. > > Chris > > ----- > > Wayang Proposal > (https://cwiki.apache.org/confluence/display/INCUBATOR/WayangProposal) > > == Abstract == > > Wayang is a cross-platform data processing system that aims at decoupling the > business logic of data analytics applications from concrete data processing > platforms, such as Apache Flink or Apache Spark. Hence, it tames the > complexity that arises from the "Cambrian explosion" of novel data processing > platforms that we currently witness. > > Note that Wayang project is the Rheem project, but we have renamed the > project because of trademark issues. > > You can find the project web page at: https://rheem-ecosystem.github.io/ > > = Proposal = > > Wayang is a cross-platform system that provides an abstraction over data > processing platforms to free users from the burdens of (i) performing tedious > and costly data migration and integration tasks to run their applications, > and (ii) choosing the right data processing platforms for their applications. > To achieve this, Wayang: (1) provides an abstraction on top of existing data > processing platforms that allows users to specify their data analytics tasks > in a form of a DAG of operators; (2) comes with a cross-platform optimizer > for automating the selection of suitable/efficient platforms; and (3) and > finally takes care of executing the optimized plan, including communication > across platforms. In summary, Wayang has the following salient features: > > - Flexible Data Model - It considers a flexible and simple data model based > on data quanta. A data quantum is an atomic processing unit in the system, > that can represent a large spectrum of data formats, such as data points for > a machine learning application, tuples for a database application, or RDF > triples. Hence, Wayang is able to express a wide range of data analytics > tasks. > - Platform independence - It provides a simple interface (currently Java and > Scala) that is inspired by established programming models, such as that of > Apache Spark and Apache Flink. Users represent their data analytic tasks as a > DAG (Wayang plan), where vertices correspond to Wayang operators and edges > represent data flows (data quanta flowing) among these operators. A Wayang > operator defines a particular kind of data transformation over an input data > quantum, ranging from basic functionality (e.g., transformations, filters, > joins) to complex, extensible tasks (e.g., PageRank). > - Cross-platform execution - Besides running a data analytic task on any data > processing platform, it also comes with an optimizer that can decide to > execute a single data analytic task using multiple data processing platforms. > This allows for exploiting the capabilities of different data processing > platforms to perform complex data analytic tasks more efficiently. > Self-tuning UDF-based cost model - Its optimizer uses a cost model fully > based on UDFs. This not only enables Wayang to learn the cost functions of > newly added data processing platforms, but also allows developers to tune the > optimizer at will. > - Extensibility - It treats data processing platforms as plugins to allow > users (developers) to easily incorporate new data processing platforms into > the system. This is achieved by exposing the functionalities of data > processing platforms as operators (execution operators). The same approach is > followed at the Wayang interface, where users can also extend Wayang > capabilities, i.e., the operators, easily. > > We plan to work on the stability of all these features as well as extending > Wayang with more advanced features. Furthermore, Wayang currently supports > Apache Spark, Standalone Java, GraphChi, relational databases (via JDBC). We > plan to incorporate more data processing platforms, such as Apache Flink and > Apache Hive. > > === Background === > > Many organizations and companies collect or produce large variety of data to > apply data analytics over them. This is because insights from data rapidly > allow them to make better decisions. Thus, the pursuit for efficient and > scalable data analytics as well as the one-size-does-not-fit-all philosophy > has given rise to a plethora of data processing platforms. Examples of these > specialized processing platforms range from DBMSs to MapReduce-like platforms. > > However, today's data analytics are moving beyond the limits of a single data > processing platform. More and more applications need to perform complex data > analytics over several data processing platforms. For example, IBM reported > that North York hospital needs to process 50 diverse datasets, which are on a > dozen different internal systems, (ii) oil & gas companies stated they need > to process large amounts of data they produce everyday, e.g., a single oil > company can produce more than 1.5TB of diverse (structured and unstructured) > data per day, (iii) Fortune magazine stated that airlines need to analyze > large datasets, which are produced by different departments, are of different > data formats, and reside on multiple data sources, to produce global reports > for decision makers, and (iv) Hewlett Packard has claimed that, according to > its customer portfolio, business intelligence typically require a single > analytics pipeline using different processing platforms at different parts of > the pipeline. These are just a few examples of emerging applications that > require a diversity of data processing platforms. > > Today, developers have to deal with this myriad of data processing platforms. > That is, they have to choose the right data processing platform for their > applications (or data analytic tasks) and to familiarize with the intricacies > of the different platforms to achieve high efficiency and scalability. > Several systems have also appeared with the goal of helping users to easily > glue several platforms together, such as Apache Drill, PrestoDB, and Luigi. > Nevertheless, all these systems still require quite good expertise from users > to decide which data processing platforms to use for the data analytic task > at hand. In consequence, great engineering effort is required to unify the > data from various sources, to combine the processing capabilities of > different platforms, and to maintain those applications, so as to unleash the > full potential of the data. In the worst case, such applications are not > built in the first place, as it seems too much of a daunting endeavor. > > === Rationale === > > It is evident that there is an urgent need to release developers from the > burden of knowing all the intricacies of choosing and glueing together data > processing platforms for supporting their applications (data analytic tasks). > Developers must focus only on the logics of their applications. Surprisingly, > there is no open source system trying to satisfy this urgent need. Wayang > aims at filling this gap. It copes with this urgent need by providing both a > common interface over data processing platforms and an optimizer to execute > data analytic tasks on the right data processing platform(s) seamlessly. As > Apache is the place where most of the important big data systems are, we then > consider Apache as the right place for Wayang. > > === Current Status === > > The current version of Wayang (v0.5.0) was initially co-developed by staff, > students, and interns at the Qatar Computing Research Institute (QCRI) and > the Hasso-Plattner Institute (HPI). The project was initiated at and > sponsored by QCRI in 2015 with the goal of freeing data scientists and > developers from the intricacies of data processing platforms to support their > analytic tasks. The first open source release of Wayang was made only one > year and a half later, in June 13th of 2016, under the Apache Software > License 2.0. Since we have made several releases, the latest release was done > on January 23th, 2019. > > ** Meritocracy ** > > All current Wayang developers are familiar with this development process at > Apache and are currently trying to follow this meritocracy process as much as > possible. For example, Wayang already follows a committer principle where any > pull request is analyzed by at least one Wayang core developer. This was one > of the reasons for choosing Apache for Wayang as we all want to encourage and > keep this style of development for Wayang. > > ** Community ** > > Wayang started as a pure research project, but it quickly started developing > into a community. People from HPI quickly joined our efforts almost from the > very beginning to make this project a reality. Recently, the Berlin Institute > of Technology (TU Berlin) and the Pontifical Catholic University of > Valparaiso (PUCV) in Chile have also joined our efforts for developing > Wayang. A company, called Scalytics, has been created around Wayang. > Currently, we are intensively seeking to further develop both developer and > user communities. To keep broadening the community, we plan to also exploit > our ongoing academic collaborations with multiple universities in Berlin and > companies that we collaborate with. For instance, Wayang is already being > utilized for accessing multiple data sources in the context of a large data > analytics project led by TU Berlin and Huawei. We also believe that Wayang's > extensible architecture (i.e., adding new operators and platforms) will > further encourage community participation. During incubation we plan to have > Wayang adopted by at least one company and will explicitly seek more > industrial participation. > > ** Core Developers ** > > The initial developers of the project are diverse, they are from four > different institutions (TU Berlin, Scalytics, PUCV, and HBKU). We will work > aggressively to grow the community during the incubation by recruiting more > developers from other institutions. > > ** Alignment ** > > We believe Apache is the most natural home for taking Wayang to the next > level. Apache is currently hosting the most important big data systems. > Hadoop, Spark, Flink, HBase, Hive, Tez, Reef, Storm, Drill, and Ignite are > just some examples of these technologies. Wayang fills a significant gap - it > provides a common abstraction for all these platforms and decides on which > platforms to run a single data analytic task - that exist in the big data > open source world. Wayang is now being developed following the Apache-style > development model. Also, it is well-aligned with the Apache principle of > building a community to impact the big data community. > > === Known Risks === > > ** Orphaned Products ** > > Currently, Wayang is the core technology behind Scalytics inc.. As a result, > a team of two engineers are working on a full time basis on this project. > Recently, three more developers have joined our efforts in building Wayang. > Thus, the risk of Wayang becoming orphaned is relatively very low. Still, > people outside Scalytics (from TU Berlin and HBKU) have also joined the > project, which makes the risk of abandoning the project even lower. The PUCV > in Chile is also beginning to contribute to the code base and to develop a > declarative query language on top of Wayang. The project is constantly being > monitored by email and frequent Skype meetings as well as by weekly meetings > with Scalytics people. Additionally, at the end of each year, we meet to > discuss the status of the project as well as to plan the most important > aspects we should work on during the year after. > > ** Inexperience with Open Source ** > > Wayang quickly started being developed in open source under the Apache > Software License 2.0. The source code is available on Github. Also few of the > initial committers have contributed to other open source projects: Hadoop and > Flume > > ** Homogeneous Developers ** > > The initial committers are already geographically distributed among Chile, > Germany, and Qatar. During incubation, one of our main goals is to increase > the heterogeneity of the current community and we will work hard to achieve > it. > > ** Reliance on salaried developers ** > > Wayang is already being developed by a mix of full time and volunteer time. > Only 2 of the initial committers are working full time on this project > (Scalytics). So, we are confident that the project will not decrease its > development pace. Furthermore, we are committed to recruit additional > committers to significantly increase the development pace of the project. > > ** Relationships with other Apache products ** > > Wayang is somehow related to Apache Spark as its developing interface is > inspired from Spark. In contrast to Spark, Wayang is not a data processing > platform, but a mediator between user applications and data processing > platforms. In this sense, Wayang is similar to the Apache Drill project, and > Apache Beam. However, Wayang significantly differs from Apache Drill in two > main aspects. First, Apache Drill provides only a common interface to query > multiple data storages and hence users have to specify in their query the > data to fetch. Then, Apache Drill translates the query to the processing > platforms where the data is stored, e.g. into mongoDB query representation. > In contrast, in Wayang, users only specify the data path and Wayang decides > which are the best (performance-wise) data processing platforms to use to > process such data. Second, the query interface in Apache Drill is SQL. Wayang > uses an interface based on operators forming DAGs. In this latter point, we > are currently developing a PIGLatin-like query language for Wayang. In > addition, in contrast to Apache Beam, Wayang not only allows users to use > multiple data processing platforms at the same time, but also it provides an > optimizer to choose the most efficient platform for the task at hand. In > Apache Beam, users have to specify an appropriate runner (platform). > Given these similarities with the two Apache projects mentioned above, we are > looking forward to collaborating with those communities. Still, we are open > and would also love to collaborate with other Apache communities as well. > ** An excessive fascination with the Apache Brand ** > > Wayang solves a real problem that currently users and developers have to deal > with at a high cost: monetary cost, high design and development efforts, and > very time consuming. Therefore, we believe that Wayang can be successful in > building a large community around it. We are convinced that the Apache brand > and community process will significantly help us in building such a community > and to establish the project in the long-term. We simply believe that ASF is > the right home for Wayang to achieve this. > > === Documentation === > > Further details, documentation, and publications related to Wayang can be > found at https://docs.rheem.io/rheem/ > > === Initial Source === > > The current source code of Wayang resides in Github: > https://github.com/rheem-ecosystem/rheem > > === External Dependencies === > > Wayang depends on the following Apache projects: > > * Maven > * HDFS > * Hadoop > * Spark > > Wayang depends on the following other open source projects organized by > license: > > org.json.json: Json (http://json.org/license.html) > SnakeYAML: Apache 2.0 > Java Unified Expression Language API (Juel): Apache 2.0 > ProfileDB Instrumentation: Apache 2.0 > Gson: Apache 2.0 > Hadoop: Apache 2.0 > Scala: Apache 2.0 > Antlr 4: BSD > Jackson: Apache 2.0 > Junit 5: EPL 2.0 > Mockito: MIT > Assertj: Apache 2.0 > logback-classic: EPL 1.0 LGPL 2.1 > slf4j: MIT > GNU Trove: LGPL 2.1 > graphchi: Apache 2.0 > SQLite JDBC: Apache 2.0 > PostgreSQL: BSD 2-clause > jcommander: Apache 2.0 > Koloboke Collections API: Apache 2.0 > Snappy Java: Apache 2.0 > Apache Spark: Apache 2.0 > HyperSQL Database: BSD Modified (http://hsqldb.org/web/hsqlLicense.html) > Apache Giraph: Apache 2.0 > Apache Flink: Apache 2.0 > Apache Commons IO: Apache 2.0 > Apache Commons Lang: Apache 2.0 > Apache Maven: Apache 2.0 > > === Cryptography === > > (not applicable) > > === Required Resources === > > ** Mailing Lists ** > > * mailto:priv...@wayang.incubator.apache.org > * mailto:d...@wayang.incubator.apache.org > * mailto:comm...@wayang.incubator.apache.org > > ** Git repositories ** > > git://git.apache.org/repos/asf/incubator/wayang > > ** Issue tracking ** > > https://issues.apache.org/jira/browse/RHEEM > > === Initial Committers === > > The following list gives the planned initial committers (in alphabetical > order): > > * Bertty Contreras-Rojas <bertty@http://scalytics.io> > * Rodrigo Pardo-Meza <rodrigo@http://scalytics.io> > * Alexander Alten-Lorenz <alo@http://scalytics.io> > * Zoi Kaoudi <zoi.kaoudi@http://tu-berlin.de> > * Haralampos Gavriilidis <gavriilidis@http://tu-berlin.de> > * Jorge-Arnulfo Quiane-Ruiz <jorge.quiane@http://tu-berlin.de> > * Anis Troudi <atroudi@http://hbku.edu.qa> > * Wenceslao Palma-Muñoz <wenceslao.palma@http://pucv.cl> > > ** Affiliations ** > > * Scalytics Inc. > ** Bertty Contreras-Rojas > ** Rodrigo Pardo-Meza > ** Alexander Alten-Lorenz > * Berlin Institute of Technology (TU Berlin) > ** Zoi Kaoudi > ** Haralampos Gavriilidis > ** Jorge-Arnulfo Quiane-Ruiz > * Hamad Bin Khalifa University (HBKU) > ** Anis Troudi > * Pontifical Catholic University of Valparaiso, Chile (PUCV) > ** Wenceslao Palma-Muñoz > > === Sponsors === > > ** Champion ** > > * Christofer Dutz (christofer.dutz at c-ware dot de) > > ** Mentors ** > > . (cdutz) Christofer Dutz > . (larsgeorge) Lars George > . (berndf) Fondermann > . (jbonofre) Jean-Baptiste Onofré > > ** Sponsoring Entity ** > > The Apache Incubator > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org