Hi, it's an interesting proposal.
I guess that one of your challenge during the incubation is to extend the community (only 3 initial committers is very low) and extend the diversity (two companies affiliation). Regards JB On 19/10/2019 17:53, Kevin Kuo wrote: > Greetings! > > We are proposing to enter sparklyr (https://spark.rstudio.com/), an open > source R package for interfacing with Apache Spark, into incubation. Please > see the proposal below. > > ====== > > = Abstract = > > sparklyr is an open source R package providing an interface to Apache > Spark, a system for large-scale data analysis on clusters. It provides a > dplyr interface for manipulating Spark DataFrames, supports the Spark ML > and Structured Streaming components, and offers a developer API to create > extensions. > > = Proposal = > > The sparklyr project, along with the ecosystem of extensions it supports, > aims to democratize the capabilities of Apache Spark for R users, who > represent a significant portion of data scientists today. The API is > designed to reduce friction for users transitioning from local, “small > data” workflows to computing on clusters, while preserving the flexibility > of Apache Spark as much as possible. Some features include: > > - It is compatible with the tidyverse ecosystem of packages, which is a > popular collection of libraries for data science in R. Specifically, one > can use `dplyr` verbs to manipulate Spark DataFrames. However, one can also > use sparklyr without using tidyverse packages. > - It features an extensions API that allows users to easily wrap existing > Spark packages written in Scala. This has enabled the development of > sparkxgb (interface for xgboost4j), graphframes (interface for > GraphFrames), mleap (interface for MLeap), and sparktf (interface for Spark > TensorFlow connector), to name a few. > > = Rationale = > > By becoming an Apache project, sparklyr can better align with the Apache > Spark project, and encourage stronger collaboration among users and > contributors in the R and Apache communities. Culturally, sparklyr is also > a good fit for ASF: the development of the project has adhered to the > Apache way since inception, and the current contributors are committed to > upholding those values. > > = Initial Goals = > > The initial goals will be to move the existing codebase to Apache and the > documentation from the RStudio domain to Apache. > > = Current Status = > > == Meritocracy == > > The sparklyr project has operated on meritocratic principles since > inception. We have accepted major patches from developers outside RStudio, > and have operated with the implicit expectation that contributors to major > features maintain those features. > > == Community == > > The sparklyr project currently has 699 stars on GitHub, 52 direct > contributors, ~1,400 issues (approximately 500 of those are open), and > approximately 194,000 downloads from CRAN each month. The documentation > website spark.rstudio.com achieves ~15k visitors per month. There are also > more than 15 open source extensions written that implement features such as > genomic analysis and interoperability with databases. > > = Known Risks = > > == Reliance on Salaried Developers == > > sparklyr is currently maintained by salaried developers at RStudio and > receives some ongoing contributions from the community, although all > committers are employed by RStudio. We hope that by becoming an Apache > project, the project will garner additional developer interest and expand > the diversity of committers. > > = Documentation = > > Documentation of the project can be found at https://spark.rstudio.com/ and > https://cran.r-project.org/web/packages/sparklyr/sparklyr.pdf. There is > also a free online book, available at https://therinspark.com/, that can be > used as a reference. > > = Initial Source = > > The sparklyr codebase is currently hosted on GitHub: > https://github.com/rstudio/sparklyr. sparklyr has been Apache 2.0 licensed > since inception. RStudio currently maintains CLAs from all significant > contributors. RStudio does not own the copyright of sparklyr and it is not > a trademark. > > = External Dependencies = > > We remark that `sparklyr` imports some R packages that are not > Apache-compatible licensed; however, these packages are not distributed > with the project. Note, for example, R itself is GPLv2 licensed. > > = Required Resources = > > - Mailing lists: {users, dev, commits}@sparklyr.incubator.apache.org > - GitHub repo > - If possible, we would like to continue using GitHub for issue tracking, > as it is much more familiar to the R community than JIRA. > > = Project Name = > > There is sufficient goodwill built around the package so we would like to > keep the name. sparklyr is pronounced spark-lee-R, i.e. does not rhyme with > the data manipulation package dplyr, and is never capitalized. Incorrect > spellings include SparklyR and sparklyR. > > = Initial Committers = > > Javier Luraschi <jav...@rstudio.com> (RStudio) > Kevin Kuo <keviny...@gmail.com> (RStudio) > Hossein Falaki <hoss...@databricks.com> (Databricks) > > = Sponsors = > > == Champion == > > Xiangrui Meng > > == Nominated Mentors == > > Xiangrui Meng > Felix Cheung > Sean R. Owen > -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org