Hi, +1 (binding)
Kind Regards, Furkan KAMACI On 11 Dec 2020 Fri at 20:04 Daniel B. Widdis <wid...@gmail.com> wrote: > +1 (non-binding). I'm interested in getting involved in this project! > > On Fri, Dec 11, 2020 at 8:33 AM Christofer Dutz <christofer.d...@c-ware.de > > > wrote: > > > Hi all, > > > > following up the [DISCUSS] thread on Wayang ( > > > https://lists.apache.org/thread.html/r5fc03ae014f44c7c31a509a6db4ac07faedb2e1c6245cd917b744826%40%3Cgeneral.incubator.apache.org%3E > ) > > I would like to call a VOTE to accept Wayang Aka Rheem into the Apache > > Incubator. > > > > Please cast your vote: > > > > [ ] +1, bring Wayang into the Incubator > > [ ] +0, I don't care either way > > [ ] -1, do not bring Wayang into the Incubator, because... > > > > The vote will open at least for 72 hours and only votes from the > Incubator > > PMC are binding, but votes from everyone are welcome. > > > > Chris > > > > ----- > > > > Wayang Proposal ( > > https://cwiki.apache.org/confluence/display/INCUBATOR/WayangProposal) > > > > == Abstract == > > > > Wayang is a cross-platform data processing system that aims at decoupling > > the business logic of data analytics applications from concrete data > > processing platforms, such as Apache Flink or Apache Spark. Hence, it > tames > > the complexity that arises from the "Cambrian explosion" of novel data > > processing platforms that we currently witness. > > > > Note that Wayang project is the Rheem project, but we have renamed the > > project because of trademark issues. > > > > You can find the project web page at: https://rheem-ecosystem.github.io/ > > > > = Proposal = > > > > Wayang is a cross-platform system that provides an abstraction over data > > processing platforms to free users from the burdens of (i) performing > > tedious and costly data migration and integration tasks to run their > > applications, and (ii) choosing the right data processing platforms for > > their applications. To achieve this, Wayang: (1) provides an abstraction > on > > top of existing data processing platforms that allows users to specify > > their data analytics tasks in a form of a DAG of operators; (2) comes > with > > a cross-platform optimizer for automating the selection of > > suitable/efficient platforms; and (3) and finally takes care of executing > > the optimized plan, including communication across platforms. In summary, > > Wayang has the following salient features: > > > > - Flexible Data Model - It considers a flexible and simple data model > > based on data quanta. A data quantum is an atomic processing unit in the > > system, that can represent a large spectrum of data formats, such as data > > points for a machine learning application, tuples for a database > > application, or RDF triples. Hence, Wayang is able to express a wide > range > > of data analytics tasks. > > - Platform independence - It provides a simple interface (currently Java > > and Scala) that is inspired by established programming models, such as > that > > of Apache Spark and Apache Flink. Users represent their data analytic > tasks > > as a DAG (Wayang plan), where vertices correspond to Wayang operators and > > edges represent data flows (data quanta flowing) among these operators. A > > Wayang operator defines a particular kind of data transformation over an > > input data quantum, ranging from basic functionality (e.g., > > transformations, filters, joins) to complex, extensible tasks (e.g., > > PageRank). > > - Cross-platform execution - Besides running a data analytic task on any > > data processing platform, it also comes with an optimizer that can decide > > to execute a single data analytic task using multiple data processing > > platforms. This allows for exploiting the capabilities of different data > > processing platforms to perform complex data analytic tasks more > > efficiently. > > Self-tuning UDF-based cost model - Its optimizer uses a cost model fully > > based on UDFs. This not only enables Wayang to learn the cost functions > of > > newly added data processing platforms, but also allows developers to tune > > the optimizer at will. > > - Extensibility - It treats data processing platforms as plugins to allow > > users (developers) to easily incorporate new data processing platforms > into > > the system. This is achieved by exposing the functionalities of data > > processing platforms as operators (execution operators). The same > approach > > is followed at the Wayang interface, where users can also extend Wayang > > capabilities, i.e., the operators, easily. > > > > We plan to work on the stability of all these features as well as > > extending Wayang with more advanced features. Furthermore, Wayang > currently > > supports Apache Spark, Standalone Java, GraphChi, relational databases > (via > > JDBC). We plan to incorporate more data processing platforms, such as > > Apache Flink and Apache Hive. > > > > === Background === > > > > Many organizations and companies collect or produce large variety of data > > to apply data analytics over them. This is because insights from data > > rapidly allow them to make better decisions. Thus, the pursuit for > > efficient and scalable data analytics as well as the > > one-size-does-not-fit-all philosophy has given rise to a plethora of data > > processing platforms. Examples of these specialized processing platforms > > range from DBMSs to MapReduce-like platforms. > > > > However, today's data analytics are moving beyond the limits of a single > > data processing platform. More and more applications need to perform > > complex data analytics over several data processing platforms. For > example, > > IBM reported that North York hospital needs to process 50 diverse > datasets, > > which are on a dozen different internal systems, (ii) oil & gas companies > > stated they need to process large amounts of data they produce everyday, > > e.g., a single oil company can produce more than 1.5TB of diverse > > (structured and unstructured) data per day, (iii) Fortune magazine stated > > that airlines need to analyze large datasets, which are produced by > > different departments, are of different data formats, and reside on > > multiple data sources, to produce global reports for decision makers, and > > (iv) Hewlett Packard has claimed that, according to its customer > portfolio, > > business intelligence typically require a single analytics pipeline using > > different processing platforms at different parts of the pipeline. These > > are just a few examples of emerging applications that require a diversity > > of data processing platforms. > > > > Today, developers have to deal with this myriad of data processing > > platforms. That is, they have to choose the right data processing > platform > > for their applications (or data analytic tasks) and to familiarize with > the > > intricacies of the different platforms to achieve high efficiency and > > scalability. Several systems have also appeared with the goal of helping > > users to easily glue several platforms together, such as Apache Drill, > > PrestoDB, and Luigi. Nevertheless, all these systems still require quite > > good expertise from users to decide which data processing platforms to > use > > for the data analytic task at hand. In consequence, great engineering > > effort is required to unify the data from various sources, to combine the > > processing capabilities of different platforms, and to maintain those > > applications, so as to unleash the full potential of the data. In the > worst > > case, such applications are not built in the first place, as it seems too > > much of a daunting endeavor. > > > > === Rationale === > > > > It is evident that there is an urgent need to release developers from the > > burden of knowing all the intricacies of choosing and glueing together > data > > processing platforms for supporting their applications (data analytic > > tasks). Developers must focus only on the logics of their applications. > > Surprisingly, there is no open source system trying to satisfy this > urgent > > need. Wayang aims at filling this gap. It copes with this urgent need by > > providing both a common interface over data processing platforms and an > > optimizer to execute data analytic tasks on the right data processing > > platform(s) seamlessly. As Apache is the place where most of the > important > > big data systems are, we then consider Apache as the right place for > Wayang. > > > > === Current Status === > > > > The current version of Wayang (v0.5.0) was initially co-developed by > > staff, students, and interns at the Qatar Computing Research Institute > > (QCRI) and the Hasso-Plattner Institute (HPI). The project was initiated > at > > and sponsored by QCRI in 2015 with the goal of freeing data scientists > and > > developers from the intricacies of data processing platforms to support > > their analytic tasks. The first open source release of Wayang was made > only > > one year and a half later, in June 13th of 2016, under the Apache > Software > > License 2.0. Since we have made several releases, the latest release was > > done on January 23th, 2019. > > > > ** Meritocracy ** > > > > All current Wayang developers are familiar with this development process > > at Apache and are currently trying to follow this meritocracy process as > > much as possible. For example, Wayang already follows a committer > principle > > where any pull request is analyzed by at least one Wayang core developer. > > This was one of the reasons for choosing Apache for Wayang as we all want > > to encourage and keep this style of development for Wayang. > > > > ** Community ** > > > > Wayang started as a pure research project, but it quickly started > > developing into a community. People from HPI quickly joined our efforts > > almost from the very beginning to make this project a reality. Recently, > > the Berlin Institute of Technology (TU Berlin) and the Pontifical > Catholic > > University of Valparaiso (PUCV) in Chile have also joined our efforts for > > developing Wayang. A company, called Scalytics, has been created around > > Wayang. Currently, we are intensively seeking to further develop both > > developer and user communities. To keep broadening the community, we plan > > to also exploit our ongoing academic collaborations with multiple > > universities in Berlin and companies that we collaborate with. For > > instance, Wayang is already being utilized for accessing multiple data > > sources in the context of a large data analytics project led by TU Berlin > > and Huawei. We also believe that Wayang's extensible architecture (i.e., > > adding new operators and platforms) will further encourage community > > participation. During incubation we plan to have Wayang adopted by at > least > > one company and will explicitly seek more industrial participation. > > > > ** Core Developers ** > > > > The initial developers of the project are diverse, they are from four > > different institutions (TU Berlin, Scalytics, PUCV, and HBKU). We will > work > > aggressively to grow the community during the incubation by recruiting > more > > developers from other institutions. > > > > ** Alignment ** > > > > We believe Apache is the most natural home for taking Wayang to the next > > level. Apache is currently hosting the most important big data systems. > > Hadoop, Spark, Flink, HBase, Hive, Tez, Reef, Storm, Drill, and Ignite > are > > just some examples of these technologies. Wayang fills a significant gap > - > > it provides a common abstraction for all these platforms and decides on > > which platforms to run a single data analytic task - that exist in the > big > > data open source world. Wayang is now being developed following the > > Apache-style development model. Also, it is well-aligned with the Apache > > principle of building a community to impact the big data community. > > > > === Known Risks === > > > > ** Orphaned Products ** > > > > Currently, Wayang is the core technology behind Scalytics inc.. As a > > result, a team of two engineers are working on a full time basis on this > > project. Recently, three more developers have joined our efforts in > > building Wayang. Thus, the risk of Wayang becoming orphaned is relatively > > very low. Still, people outside Scalytics (from TU Berlin and HBKU) have > > also joined the project, which makes the risk of abandoning the project > > even lower. The PUCV in Chile is also beginning to contribute to the code > > base and to develop a declarative query language on top of Wayang. The > > project is constantly being monitored by email and frequent Skype > meetings > > as well as by weekly meetings with Scalytics people. Additionally, at the > > end of each year, we meet to discuss the status of the project as well as > > to plan the most important aspects we should work on during the year > after. > > > > ** Inexperience with Open Source ** > > > > Wayang quickly started being developed in open source under the Apache > > Software License 2.0. The source code is available on Github. Also few of > > the initial committers have contributed to other open source projects: > > Hadoop and Flume > > > > ** Homogeneous Developers ** > > > > The initial committers are already geographically distributed among > Chile, > > Germany, and Qatar. During incubation, one of our main goals is to > increase > > the heterogeneity of the current community and we will work hard to > achieve > > it. > > > > ** Reliance on salaried developers ** > > > > Wayang is already being developed by a mix of full time and volunteer > > time. Only 2 of the initial committers are working full time on this > > project (Scalytics). So, we are confident that the project will not > > decrease its development pace. Furthermore, we are committed to recruit > > additional committers to significantly increase the development pace of > the > > project. > > > > ** Relationships with other Apache products ** > > > > Wayang is somehow related to Apache Spark as its developing interface is > > inspired from Spark. In contrast to Spark, Wayang is not a data > processing > > platform, but a mediator between user applications and data processing > > platforms. In this sense, Wayang is similar to the Apache Drill project, > > and Apache Beam. However, Wayang significantly differs from Apache Drill > in > > two main aspects. First, Apache Drill provides only a common interface to > > query multiple data storages and hence users have to specify in their > query > > the data to fetch. Then, Apache Drill translates the query to the > > processing platforms where the data is stored, e.g. into mongoDB query > > representation. In contrast, in Wayang, users only specify the data path > > and Wayang decides which are the best (performance-wise) data processing > > platforms to use to process such data. Second, the query interface in > > Apache Drill is SQL. Wayang uses an interface based on operators forming > > DAGs. In this latter point, we are currently developing a PIGLatin-like > > query language for Wayang. In addition, in contrast to Apache Beam, > Wayang > > not only allows users to use multiple data processing platforms at the > same > > time, but also it provides an optimizer to choose the most efficient > > platform for the task at hand. In Apache Beam, users have to specify an > > appropriate runner (platform). > > Given these similarities with the two Apache projects mentioned above, we > > are looking forward to collaborating with those communities. Still, we > are > > open and would also love to collaborate with other Apache communities as > > well. > > ** An excessive fascination with the Apache Brand ** > > > > Wayang solves a real problem that currently users and developers have to > > deal with at a high cost: monetary cost, high design and development > > efforts, and very time consuming. Therefore, we believe that Wayang can > be > > successful in building a large community around it. We are convinced that > > the Apache brand and community process will significantly help us in > > building such a community and to establish the project in the long-term. > We > > simply believe that ASF is the right home for Wayang to achieve this. > > > > === Documentation === > > > > Further details, documentation, and publications related to Wayang can be > > found at https://docs.rheem.io/rheem/ > > > > === Initial Source === > > > > The current source code of Wayang resides in Github: > > https://github.com/rheem-ecosystem/rheem > > > > === External Dependencies === > > > > Wayang depends on the following Apache projects: > > > > * Maven > > * HDFS > > * Hadoop > > * Spark > > > > Wayang depends on the following other open source projects organized by > > license: > > > > org.json.json: Json (http://json.org/license.html) > > SnakeYAML: Apache 2.0 > > Java Unified Expression Language API (Juel): Apache 2.0 > > ProfileDB Instrumentation: Apache 2.0 > > Gson: Apache 2.0 > > Hadoop: Apache 2.0 > > Scala: Apache 2.0 > > Antlr 4: BSD > > Jackson: Apache 2.0 > > Junit 5: EPL 2.0 > > Mockito: MIT > > Assertj: Apache 2.0 > > logback-classic: EPL 1.0 LGPL 2.1 > > slf4j: MIT > > GNU Trove: LGPL 2.1 > > graphchi: Apache 2.0 > > SQLite JDBC: Apache 2.0 > > PostgreSQL: BSD 2-clause > > jcommander: Apache 2.0 > > Koloboke Collections API: Apache 2.0 > > Snappy Java: Apache 2.0 > > Apache Spark: Apache 2.0 > > HyperSQL Database: BSD Modified (http://hsqldb.org/web/hsqlLicense.html) > > Apache Giraph: Apache 2.0 > > Apache Flink: Apache 2.0 > > Apache Commons IO: Apache 2.0 > > Apache Commons Lang: Apache 2.0 > > Apache Maven: Apache 2.0 > > > > === Cryptography === > > > > (not applicable) > > > > === Required Resources === > > > > ** Mailing Lists ** > > > > * mailto:priv...@wayang.incubator.apache.org > > * mailto:d...@wayang.incubator.apache.org > > * mailto:comm...@wayang.incubator.apache.org > > > > ** Git repositories ** > > > > git://git.apache.org/repos/asf/incubator/wayang > > > > ** Issue tracking ** > > > > https://issues.apache.org/jira/browse/RHEEM > > > > === Initial Committers === > > > > The following list gives the planned initial committers (in alphabetical > > order): > > > > * Bertty Contreras-Rojas <bertty@http://scalytics.io> > > * Rodrigo Pardo-Meza <rodrigo@http://scalytics.io> > > * Alexander Alten-Lorenz <alo@http://scalytics.io> > > * Zoi Kaoudi <zoi.kaoudi@http://tu-berlin.de> > > * Haralampos Gavriilidis <gavriilidis@http://tu-berlin.de> > > * Jorge-Arnulfo Quiane-Ruiz <jorge.quiane@http://tu-berlin.de> > > * Anis Troudi <atroudi@http://hbku.edu.qa> > > * Wenceslao Palma-Muñoz <wenceslao.palma@http://pucv.cl> > > > > ** Affiliations ** > > > > * Scalytics Inc. > > ** Bertty Contreras-Rojas > > ** Rodrigo Pardo-Meza > > ** Alexander Alten-Lorenz > > * Berlin Institute of Technology (TU Berlin) > > ** Zoi Kaoudi > > ** Haralampos Gavriilidis > > ** Jorge-Arnulfo Quiane-Ruiz > > * Hamad Bin Khalifa University (HBKU) > > ** Anis Troudi > > * Pontifical Catholic University of Valparaiso, Chile (PUCV) > > ** Wenceslao Palma-Muñoz > > > > === Sponsors === > > > > ** Champion ** > > > > * Christofer Dutz (christofer.dutz at c-ware dot de) > > > > ** Mentors ** > > > > . (cdutz) Christofer Dutz > > . (larsgeorge) Lars George > > . (berndf) Fondermann > > . (jbonofre) Jean-Baptiste Onofré > > > > ** Sponsoring Entity ** > > > > The Apache Incubator > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > > -- > Dan Widdis >