Re: [VOTE] Accept Wayang into the Apache Incubator

Furkan KAMACI Fri, 11 Dec 2020 09:35:30 -0800

Hi,

+1 (binding)


Kind Regards,
Furkan KAMACI

On 11 Dec 2020 Fri at 20:04 Daniel B. Widdis <wid...@gmail.com> wrote:

> +1 (non-binding).  I'm interested in getting involved in this project!
>
> On Fri, Dec 11, 2020 at 8:33 AM Christofer Dutz <christofer.d...@c-ware.de
> >
> wrote:
>
> > Hi all,
> >
> > following up the [DISCUSS] thread on Wayang (
> >
> https://lists.apache.org/thread.html/r5fc03ae014f44c7c31a509a6db4ac07faedb2e1c6245cd917b744826%40%3Cgeneral.incubator.apache.org%3E
> )
> > I would like to call a VOTE to accept Wayang Aka Rheem into the Apache
> > Incubator.
> >
> > Please cast your vote:
> >
> >   [ ] +1, bring Wayang into the Incubator
> >   [ ] +0, I don't care either way
> >   [ ] -1, do not bring Wayang into the Incubator, because...
> >
> > The vote will open at least for 72 hours and only votes from the
> Incubator
> > PMC are binding, but votes from everyone are welcome.
> >
> > Chris
> >
> > -----
> >
> > Wayang Proposal (
> > https://cwiki.apache.org/confluence/display/INCUBATOR/WayangProposal)
> >
> > == Abstract ==
> >
> > Wayang is a cross-platform data processing system that aims at decoupling
> > the business logic of data analytics applications from concrete data
> > processing platforms, such as Apache Flink or Apache Spark. Hence, it
> tames
> > the complexity that arises from the "Cambrian explosion" of novel data
> > processing platforms that we currently witness.
> >
> > Note that Wayang project is the Rheem project, but we have renamed the
> > project because of trademark issues.
> >
> > You can find the project web page at: https://rheem-ecosystem.github.io/
> >
> > = Proposal =
> >
> > Wayang is a cross-platform system that provides an abstraction over data
> > processing platforms to free users from the burdens of (i) performing
> > tedious and costly data migration and integration tasks to run their
> > applications, and (ii) choosing the right data processing platforms for
> > their applications. To achieve this, Wayang: (1) provides an abstraction
> on
> > top of existing data processing platforms that allows users to specify
> > their data analytics tasks in a form of a DAG of operators; (2) comes
> with
> > a cross-platform optimizer for automating the selection of
> > suitable/efficient platforms; and (3) and finally takes care of executing
> > the optimized plan, including communication across platforms. In summary,
> > Wayang has the following salient features:
> >
> > - Flexible Data Model - It considers a flexible and simple data model
> > based on data quanta. A data quantum is an atomic processing unit in the
> > system, that can represent a large spectrum of data formats, such as data
> > points for a machine learning application, tuples for a database
> > application, or RDF triples. Hence, Wayang is able to express a wide
> range
> > of data analytics tasks.
> > - Platform independence - It provides a simple interface (currently Java
> > and Scala) that is inspired by established programming models, such as
> that
> > of Apache Spark and Apache Flink. Users represent their data analytic
> tasks
> > as a DAG (Wayang plan), where vertices correspond to Wayang operators and
> > edges represent data flows (data quanta flowing) among these operators. A
> > Wayang operator defines a particular kind of data transformation over an
> > input data quantum, ranging from basic functionality (e.g.,
> > transformations, filters, joins) to complex, extensible tasks (e.g.,
> > PageRank).
> > - Cross-platform execution - Besides running a data analytic task on any
> > data processing platform, it also comes with an optimizer that can decide
> > to execute a single data analytic task using multiple data processing
> > platforms. This allows for exploiting the capabilities of different data
> > processing platforms to perform complex data analytic tasks more
> > efficiently.
> > Self-tuning UDF-based cost model - Its optimizer uses a cost model fully
> > based on UDFs. This not only enables Wayang to learn the cost functions
> of
> > newly added data processing platforms, but also allows developers to tune
> > the optimizer at will.
> > - Extensibility - It treats data processing platforms as plugins to allow
> > users (developers) to easily incorporate new data processing platforms
> into
> > the system. This is achieved by exposing the functionalities of data
> > processing platforms as operators (execution operators). The same
> approach
> > is followed at the Wayang interface, where users can also extend Wayang
> > capabilities, i.e., the operators, easily.
> >
> > We plan to work on the stability of all these features as well as
> > extending Wayang with more advanced features. Furthermore, Wayang
> currently
> > supports Apache Spark, Standalone Java, GraphChi, relational databases
> (via
> > JDBC). We plan to incorporate more data processing platforms, such as
> > Apache Flink and Apache Hive.
> >
> > === Background ===
> >
> > Many organizations and companies collect or produce large variety of data
> > to apply data analytics over them. This is because insights from data
> > rapidly allow them to make better decisions. Thus, the pursuit for
> > efficient and scalable data analytics as well as the
> > one-size-does-not-fit-all philosophy has given rise to a plethora of data
> > processing platforms. Examples of these specialized processing platforms
> > range from DBMSs to MapReduce-like platforms.
> >
> > However, today's data analytics are moving beyond the limits of a single
> > data processing platform. More and more applications need to perform
> > complex data analytics over several data processing platforms. For
> example,
> > IBM reported that North York hospital needs to process 50 diverse
> datasets,
> > which are on a dozen different internal systems, (ii) oil & gas companies
> > stated they need to process large amounts of data they produce everyday,
> > e.g., a single oil company can produce more than 1.5TB of diverse
> > (structured and unstructured) data per day, (iii) Fortune magazine stated
> > that airlines need to analyze large datasets, which are produced by
> > different departments, are of different data formats, and reside on
> > multiple data sources, to produce global reports for decision makers, and
> > (iv) Hewlett Packard has claimed that, according to its customer
> portfolio,
> > business intelligence typically require a single analytics pipeline using
> > different processing platforms at different parts of the pipeline. These
> > are just a few examples of emerging applications that require a diversity
> > of data processing platforms.
> >
> > Today, developers have to deal with this myriad of data processing
> > platforms. That is, they have to choose the right data processing
> platform
> > for their applications (or data analytic tasks) and to familiarize with
> the
> > intricacies of the different platforms to achieve high efficiency and
> > scalability. Several systems have also appeared with the goal of helping
> > users to easily glue several platforms together, such as Apache Drill,
> > PrestoDB, and Luigi. Nevertheless, all these systems still require quite
> > good expertise from users to decide which data processing platforms to
> use
> > for the data analytic task at hand. In consequence, great engineering
> > effort is required to unify the data from various sources, to combine the
> > processing capabilities of different platforms, and to maintain those
> > applications, so as to unleash the full potential of the data. In the
> worst
> > case, such applications are not built in the first place, as it seems too
> > much of a daunting endeavor.
> >
> > === Rationale ===
> >
> > It is evident that there is an urgent need to release developers from the
> > burden of knowing all the intricacies of choosing and glueing together
> data
> > processing platforms for supporting their applications (data analytic
> > tasks). Developers must focus only on the logics of their applications.
> > Surprisingly, there is no open source system trying to satisfy this
> urgent
> > need. Wayang aims at filling this gap. It copes with this urgent need by
> > providing both a common interface over data processing platforms and an
> > optimizer to execute data analytic tasks on the right data processing
> > platform(s) seamlessly. As Apache is the place where most of the
> important
> > big data systems are, we then consider Apache as the right place for
> Wayang.
> >
> > === Current Status ===
> >
> > The current version of Wayang (v0.5.0) was initially co-developed by
> > staff, students, and interns at the Qatar Computing Research Institute
> > (QCRI) and the Hasso-Plattner Institute (HPI). The project was initiated
> at
> > and sponsored by QCRI in 2015 with the goal of freeing data scientists
> and
> > developers from the intricacies of data processing platforms to support
> > their analytic tasks. The first open source release of Wayang was made
> only
> > one year and a half later, in June 13th of 2016, under the Apache
> Software
> > License 2.0. Since we have made several releases, the latest release was
> > done on January 23th, 2019.
> >
> > ** Meritocracy **
> >
> > All current Wayang developers are familiar with this development process
> > at Apache and are currently trying to follow this meritocracy process as
> > much as possible. For example, Wayang already follows a committer
> principle
> > where any pull request is analyzed by at least one Wayang core developer.
> > This was one of the reasons for choosing Apache for Wayang as we all want
> > to encourage and keep this style of development for Wayang.
> >
> > ** Community **
> >
> > Wayang started as a pure research project, but it quickly started
> > developing into a community. People from HPI quickly joined our efforts
> > almost from the very beginning to make this project a reality. Recently,
> > the Berlin Institute of Technology (TU Berlin) and the Pontifical
> Catholic
> > University of Valparaiso (PUCV) in Chile have also joined our efforts for
> > developing Wayang. A company, called Scalytics, has been created around
> > Wayang. Currently, we are intensively seeking to further develop both
> > developer and user communities. To keep broadening the community, we plan
> > to also exploit our ongoing academic collaborations with multiple
> > universities in Berlin and companies that we collaborate with. For
> > instance, Wayang is already being utilized for accessing multiple data
> > sources in the context of a large data analytics project led by TU Berlin
> > and Huawei. We also believe that Wayang's extensible architecture (i.e.,
> > adding new operators and platforms) will further encourage community
> > participation. During incubation we plan to have Wayang adopted by at
> least
> > one company and will explicitly seek more industrial participation.
> >
> > ** Core Developers **
> >
> > The initial developers of the project are diverse, they are from four
> > different institutions (TU Berlin, Scalytics, PUCV, and HBKU). We will
> work
> > aggressively to grow the community during the incubation by recruiting
> more
> > developers from other institutions.
> >
> > ** Alignment **
> >
> > We believe Apache is the most natural home for taking Wayang to the next
> > level. Apache is currently hosting the most important big data systems.
> > Hadoop, Spark, Flink, HBase, Hive, Tez, Reef, Storm, Drill, and Ignite
> are
> > just some examples of these technologies. Wayang fills a significant gap
> -
> > it provides a common abstraction for all these platforms and decides on
> > which platforms to run a single data analytic task - that exist in the
> big
> > data open source world. Wayang is now being developed following the
> > Apache-style development model. Also, it is well-aligned with the Apache
> > principle of building a community to impact the big data community.
> >
> > === Known Risks ===
> >
> > ** Orphaned Products **
> >
> > Currently, Wayang is the core technology behind Scalytics inc.. As a
> > result, a team of two engineers are working on a full time basis on this
> > project. Recently, three more developers have joined our efforts in
> > building Wayang. Thus, the risk of Wayang becoming orphaned is relatively
> > very low. Still, people outside Scalytics (from TU Berlin and HBKU) have
> > also joined the project, which makes the risk of abandoning the project
> > even lower. The PUCV in Chile is also beginning to contribute to the code
> > base and to develop a declarative query language on top of Wayang. The
> > project is constantly being monitored by email and frequent Skype
> meetings
> > as well as by weekly meetings with Scalytics people. Additionally, at the
> > end of each year, we meet to discuss the status of the project as well as
> > to plan the most important aspects we should work on during the year
> after.
> >
> > ** Inexperience with Open Source **
> >
> > Wayang quickly started being developed in open source under the Apache
> > Software License 2.0. The source code is available on Github. Also few of
> > the initial committers have contributed to other open source projects:
> > Hadoop and Flume
> >
> > ** Homogeneous Developers **
> >
> > The initial committers are already geographically distributed among
> Chile,
> > Germany, and Qatar. During incubation, one of our main goals is to
> increase
> > the heterogeneity of the current community and we will work hard to
> achieve
> > it.
> >
> > ** Reliance on salaried developers **
> >
> > Wayang is already being developed by a mix of full time and volunteer
> > time. Only 2 of the initial committers are working full time on this
> > project (Scalytics). So, we are confident that the project will not
> > decrease its development pace. Furthermore, we are committed to recruit
> > additional committers to significantly increase the development pace of
> the
> > project.
> >
> > ** Relationships with other Apache products **
> >
> > Wayang is somehow related to Apache Spark as its developing interface is
> > inspired from Spark. In contrast to Spark, Wayang is not a data
> processing
> > platform, but a mediator between user applications and data processing
> > platforms. In this sense, Wayang is similar to the Apache Drill project,
> > and Apache Beam. However, Wayang significantly differs from Apache Drill
> in
> > two main aspects. First, Apache Drill provides only a common interface to
> > query multiple data storages and hence users have to specify in their
> query
> > the data to fetch. Then, Apache Drill translates the query to the
> > processing platforms where the data is stored, e.g. into mongoDB query
> > representation. In contrast, in Wayang, users only specify the data path
> > and Wayang decides which are the best (performance-wise) data processing
> > platforms to use to process such data. Second, the query interface in
> > Apache Drill is SQL. Wayang uses an interface based on operators forming
> > DAGs. In this latter point, we are currently developing a PIGLatin-like
> > query language for Wayang. In addition, in contrast to Apache Beam,
> Wayang
> > not only allows users to use multiple data processing platforms at the
> same
> > time, but also it provides an optimizer to choose the most efficient
> > platform for the task at hand. In Apache Beam, users have to specify an
> > appropriate runner (platform).
> > Given these similarities with the two Apache projects mentioned above, we
> > are looking forward to collaborating with those communities. Still, we
> are
> > open and would also love to collaborate with other Apache communities as
> > well.
> > ** An excessive fascination with the Apache Brand **
> >
> > Wayang solves a real problem that currently users and developers have to
> > deal with at a high cost: monetary cost, high design and development
> > efforts, and very time consuming. Therefore, we believe that Wayang can
> be
> > successful in building a large community around it. We are convinced that
> > the Apache brand and community process will significantly help us in
> > building such a community and to establish the project in the long-term.
> We
> > simply believe that ASF is the right home for Wayang to achieve this.
> >
> > === Documentation ===
> >
> > Further details, documentation, and publications related to Wayang can be
> > found at https://docs.rheem.io/rheem/
> >
> > === Initial Source ===
> >
> > The current source code of Wayang resides in Github:
> > https://github.com/rheem-ecosystem/rheem
> >
> > === External Dependencies ===
> >
> > Wayang depends on the following Apache projects:
> >
> > * Maven
> > * HDFS
> > * Hadoop
> > * Spark
> >
> > Wayang depends on the following other open source projects organized by
> > license:
> >
> > org.json.json: Json (http://json.org/license.html)
> > SnakeYAML: Apache 2.0
> > Java Unified Expression Language API (Juel): Apache 2.0
> > ProfileDB Instrumentation: Apache 2.0
> > Gson: Apache 2.0
> > Hadoop: Apache 2.0
> > Scala: Apache 2.0
> > Antlr 4: BSD
> > Jackson: Apache 2.0
> > Junit 5: EPL 2.0
> > Mockito: MIT
> > Assertj: Apache 2.0
> > logback-classic: EPL 1.0 LGPL 2.1
> > slf4j: MIT
> > GNU Trove: LGPL 2.1
> > graphchi: Apache 2.0
> > SQLite JDBC: Apache 2.0
> > PostgreSQL: BSD 2-clause
> > jcommander: Apache 2.0
> > Koloboke Collections API: Apache 2.0
> > Snappy Java: Apache 2.0
> > Apache Spark: Apache 2.0
> > HyperSQL Database: BSD Modified (http://hsqldb.org/web/hsqlLicense.html)
> > Apache Giraph: Apache 2.0
> > Apache Flink: Apache 2.0
> > Apache Commons IO: Apache 2.0
> > Apache Commons Lang: Apache 2.0
> > Apache Maven: Apache 2.0
> >
> > === Cryptography ===
> >
> > (not applicable)
> >
> > === Required Resources ===
> >
> > ** Mailing Lists **
> >
> > * mailto:priv...@wayang.incubator.apache.org
> > * mailto:d...@wayang.incubator.apache.org
> > * mailto:comm...@wayang.incubator.apache.org
> >
> > ** Git repositories **
> >
> > git://git.apache.org/repos/asf/incubator/wayang
> >
> > ** Issue tracking **
> >
> > https://issues.apache.org/jira/browse/RHEEM
> >
> > === Initial Committers ===
> >
> > The following list gives the planned initial committers (in alphabetical
> > order):
> >
> > * Bertty Contreras-Rojas <bertty@http://scalytics.io>
> > * Rodrigo Pardo-Meza <rodrigo@http://scalytics.io>
> > * Alexander Alten-Lorenz <alo@http://scalytics.io>
> > * Zoi Kaoudi <zoi.kaoudi@http://tu-berlin.de>
> > * Haralampos Gavriilidis <gavriilidis@http://tu-berlin.de>
> > * Jorge-Arnulfo Quiane-Ruiz <jorge.quiane@http://tu-berlin.de>
> > * Anis Troudi <atroudi@http://hbku.edu.qa>
> > * Wenceslao Palma-Muñoz <wenceslao.palma@http://pucv.cl>
> >
> > ** Affiliations **
> >
> > * Scalytics Inc.
> > ** Bertty Contreras-Rojas
> > ** Rodrigo Pardo-Meza
> > ** Alexander Alten-Lorenz
> > * Berlin Institute of Technology (TU Berlin)
> > ** Zoi Kaoudi
> > ** Haralampos Gavriilidis
> > ** Jorge-Arnulfo Quiane-Ruiz
> > * Hamad Bin Khalifa University (HBKU)
> > ** Anis Troudi
> > * Pontifical Catholic University of Valparaiso, Chile (PUCV)
> > ** Wenceslao Palma-Muñoz
> >
> > === Sponsors ===
> >
> > ** Champion **
> >
> > * Christofer Dutz (christofer.dutz at c-ware dot de)
> >
> > ** Mentors **
> >
> > . (cdutz) Christofer Dutz
> > . (larsgeorge) Lars George
> > . (berndf) Fondermann
> > . (jbonofre) Jean-Baptiste Onofré
> >
> > ** Sponsoring Entity **
> >
> > The Apache Incubator
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
> >
>
> --
> Dan Widdis
>

Re: [VOTE] Accept Wayang into the Apache Incubator

Reply via email to