Greetings!

We are proposing to enter sparklyr (https://spark.rstudio.com/), an open
source R package for interfacing with Apache Spark, into incubation. Please
see the proposal below.

======

= Abstract =

sparklyr is an open source R package providing an interface to Apache
Spark, a system for large-scale data analysis on clusters. It provides a
dplyr interface for manipulating Spark DataFrames, supports the Spark ML
and Structured Streaming components, and offers a developer API to create
extensions.

= Proposal =

The sparklyr project, along with the ecosystem of extensions it supports,
aims to democratize the capabilities of Apache Spark for R users, who
represent a significant portion of data scientists today. The API is
designed to reduce friction for users transitioning from local, “small
data” workflows to computing on clusters, while preserving the flexibility
of Apache Spark as much as possible. Some features include:

- It is compatible with the tidyverse ecosystem of packages, which is a
popular collection of libraries for data science in R. Specifically, one
can use `dplyr` verbs to manipulate Spark DataFrames. However, one can also
use sparklyr without using tidyverse packages.
- It features an extensions API that allows users to easily wrap existing
Spark packages written in Scala. This has enabled the development of
sparkxgb (interface for xgboost4j), graphframes (interface for
GraphFrames), mleap (interface for MLeap), and sparktf (interface for Spark
TensorFlow connector), to name a few.

= Rationale =

By becoming an Apache project, sparklyr can better align with the Apache
Spark project, and encourage stronger collaboration among users and
contributors in the R and Apache communities. Culturally, sparklyr is also
a good fit for ASF: the development of the project has adhered to the
Apache way since inception, and the current contributors are committed to
upholding those values.

= Initial Goals =

The initial goals will be to move the existing codebase to Apache and the
documentation from the RStudio domain to Apache.

= Current Status =

== Meritocracy ==

The sparklyr project has operated on meritocratic principles since
inception. We have accepted major patches from developers outside RStudio,
and have operated with the implicit expectation that contributors to major
features maintain those features.

== Community ==

The sparklyr project currently has 699 stars on GitHub, 52 direct
contributors, ~1,400 issues (approximately 500 of those are open), and
approximately 194,000 downloads from CRAN each month. The documentation
website spark.rstudio.com achieves ~15k visitors per month. There are also
more than 15 open source extensions written that implement features such as
genomic analysis and interoperability with databases.

= Known Risks =

== Reliance on Salaried Developers ==

sparklyr is currently maintained by salaried developers at RStudio and
receives some ongoing contributions from the community, although all
committers are employed by RStudio. We hope that by becoming an Apache
project, the project will garner additional developer interest and expand
the diversity of committers.

= Documentation =

Documentation of the project can be found at https://spark.rstudio.com/ and
https://cran.r-project.org/web/packages/sparklyr/sparklyr.pdf. There is
also a free online book, available at https://therinspark.com/, that can be
used as a reference.

= Initial Source =

The sparklyr codebase is currently hosted on GitHub:
https://github.com/rstudio/sparklyr. sparklyr has been Apache 2.0 licensed
since inception. RStudio currently maintains CLAs from all significant
contributors. RStudio does not own the copyright of sparklyr and it is not
a trademark.

= External Dependencies =

We remark that `sparklyr` imports some R packages that are not
Apache-compatible licensed; however, these packages are not distributed
with the project. Note, for example, R itself is GPLv2 licensed.

= Required Resources =

- Mailing lists: {users, dev, commits}@sparklyr.incubator.apache.org
- GitHub repo
- If possible, we would like to continue using GitHub for issue tracking,
as it is much more familiar to the R community than JIRA.

= Project Name =

There is sufficient goodwill built around the package so we would like to
keep the name. sparklyr is pronounced spark-lee-R, i.e. does not rhyme with
the data manipulation package dplyr, and is never capitalized. Incorrect
spellings include SparklyR and sparklyR.

= Initial Committers =

Javier Luraschi <jav...@rstudio.com> (RStudio)
Kevin Kuo <keviny...@gmail.com> (RStudio)
Hossein Falaki <hoss...@databricks.com> (Databricks)

= Sponsors =

== Champion ==

Xiangrui Meng

== Nominated Mentors ==

Xiangrui Meng
Felix Cheung
Sean R. Owen

Reply via email to