Greetings! We are proposing to enter sparklyr (https://spark.rstudio.com/), an open source R package for interfacing with Apache Spark, into incubation. Please see the proposal below.
====== = Abstract = sparklyr is an open source R package providing an interface to Apache Spark, a system for large-scale data analysis on clusters. It provides a dplyr interface for manipulating Spark DataFrames, supports the Spark ML and Structured Streaming components, and offers a developer API to create extensions. = Proposal = The sparklyr project, along with the ecosystem of extensions it supports, aims to democratize the capabilities of Apache Spark for R users, who represent a significant portion of data scientists today. The API is designed to reduce friction for users transitioning from local, “small data” workflows to computing on clusters, while preserving the flexibility of Apache Spark as much as possible. Some features include: - It is compatible with the tidyverse ecosystem of packages, which is a popular collection of libraries for data science in R. Specifically, one can use `dplyr` verbs to manipulate Spark DataFrames. However, one can also use sparklyr without using tidyverse packages. - It features an extensions API that allows users to easily wrap existing Spark packages written in Scala. This has enabled the development of sparkxgb (interface for xgboost4j), graphframes (interface for GraphFrames), mleap (interface for MLeap), and sparktf (interface for Spark TensorFlow connector), to name a few. = Rationale = By becoming an Apache project, sparklyr can better align with the Apache Spark project, and encourage stronger collaboration among users and contributors in the R and Apache communities. Culturally, sparklyr is also a good fit for ASF: the development of the project has adhered to the Apache way since inception, and the current contributors are committed to upholding those values. = Initial Goals = The initial goals will be to move the existing codebase to Apache and the documentation from the RStudio domain to Apache. = Current Status = == Meritocracy == The sparklyr project has operated on meritocratic principles since inception. We have accepted major patches from developers outside RStudio, and have operated with the implicit expectation that contributors to major features maintain those features. == Community == The sparklyr project currently has 699 stars on GitHub, 52 direct contributors, ~1,400 issues (approximately 500 of those are open), and approximately 194,000 downloads from CRAN each month. The documentation website spark.rstudio.com achieves ~15k visitors per month. There are also more than 15 open source extensions written that implement features such as genomic analysis and interoperability with databases. = Known Risks = == Reliance on Salaried Developers == sparklyr is currently maintained by salaried developers at RStudio and receives some ongoing contributions from the community, although all committers are employed by RStudio. We hope that by becoming an Apache project, the project will garner additional developer interest and expand the diversity of committers. = Documentation = Documentation of the project can be found at https://spark.rstudio.com/ and https://cran.r-project.org/web/packages/sparklyr/sparklyr.pdf. There is also a free online book, available at https://therinspark.com/, that can be used as a reference. = Initial Source = The sparklyr codebase is currently hosted on GitHub: https://github.com/rstudio/sparklyr. sparklyr has been Apache 2.0 licensed since inception. RStudio currently maintains CLAs from all significant contributors. RStudio does not own the copyright of sparklyr and it is not a trademark. = External Dependencies = We remark that `sparklyr` imports some R packages that are not Apache-compatible licensed; however, these packages are not distributed with the project. Note, for example, R itself is GPLv2 licensed. = Required Resources = - Mailing lists: {users, dev, commits}@sparklyr.incubator.apache.org - GitHub repo - If possible, we would like to continue using GitHub for issue tracking, as it is much more familiar to the R community than JIRA. = Project Name = There is sufficient goodwill built around the package so we would like to keep the name. sparklyr is pronounced spark-lee-R, i.e. does not rhyme with the data manipulation package dplyr, and is never capitalized. Incorrect spellings include SparklyR and sparklyR. = Initial Committers = Javier Luraschi <jav...@rstudio.com> (RStudio) Kevin Kuo <keviny...@gmail.com> (RStudio) Hossein Falaki <hoss...@databricks.com> (Databricks) = Sponsors = == Champion == Xiangrui Meng == Nominated Mentors == Xiangrui Meng Felix Cheung Sean R. Owen