Hi

About ten years ago, I created the original SparkR package as part of my
research at UC Berkeley [SPARK-5654
<https://issues.apache.org/jira/browse/SPARK-5654>]. After my PhD I started
as a professor at UW-Madison and my contributions to SparkR have been in
the background given my availability. I continue to be involved in the
community and teach a popular course at UW-Madison which uses Apache Spark
for programming assignments.

As the original contributor and author of a research paper on SparkR, I
also continue to get private emails from users. A common question I get is
whether one should use SparkR in Apache Spark or the sparklyr package
(built on top of Apache Spark). You can also see this in StackOverflow
questions and other blog posts online:
https://www.google.com/search?q=sparkr+vs+sparklyr . While, I have
encouraged users to choose the SparkR package as it is maintained by the
Apache project, the more I looked into sparklyr, the more I was convinced
that it is a better choice for R users that want to leverage the power of
Spark:

(1) sparklyr is developed by a community of developers who understand the R
programming language deeply, and as a result is more idiomatic. In
hindsight, sparklyr’s more idiomatic approach would have been a better
choice than the Scala-like API we have in SparkR.

(2) Contributions to SparkR have decreased slowly. Over the last two years,
there have been 65 commits on the Spark R codebase (compared to ~2200 on
the Spark Python code base). In contrast Sparklyr has over 300 commits in
the same period..

(3) Previously, using and deploying sparklyr had been cumbersome as it
needed careful alignment of versions between Apache Spark and sparklyr.
However, the sparklyr community has implemented a new Spark Connect based
architecture which eliminates this issue.

(4) The sparklyr community has maintained their package on CRAN – it takes
some effort to do this as the CRAN release process requires passing a
number of tests. While SparkR was on CRAN initially, we could not maintain
that given our release process and cadence. This makes sparklyr much more
accessible to the R community.

So it is with a bittersweet feeling that I’m writing this email to propose
that we deprecate SparkR, and recommend sparklyr as the R language binding
for Spark. This will reduce complexity of our own codebase, and more
importantly reduce confusion for users. As the sparklyr package is
distributed using the same permissive license as Apache Spark, there should
be no downside for existing SparkR users in adopting it.

My proposal is to mark SparkR as deprecated in the upcoming Spark 4
release, and remove it from Apache Spark with the following major release,
Spark 5.

I’m looking forward to hearing your thoughts and feedback on this proposal
and I’m happy to create the SPIP ticket for a vote on this proposal using
this email thread as the justification.

Thanks
Shivaram

Reply via email to