Re: [DISCUSS] Deprecating SparkR

Reynold Xin Tue, 13 Aug 2024 01:04:39 -0700

+1

It’s actually great that projects outside Spark’s repo can be more
successful than the projects inside. A testament to both Spark itself and
Spark Connect!


On Tue, Aug 13, 2024 at 10:00 AM Martin Grund <mar...@databricks.com.invalid>
wrote:

> +1
>
> On Tue, Aug 13, 2024 at 7:26 AM Ruifeng Zheng <ruife...@apache.org> wrote:
>
>> +1
>>
>> On Tue, Aug 13, 2024 at 1:08 PM Holden Karau <holden.ka...@gmail.com>
>> wrote:
>>
>>> +1
>>>
>>> Are the sparklyr folks on this list?
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> Pronouns: she/her
>>>
>>>
>>> On Mon, Aug 12, 2024 at 5:22 PM Xiao Li <gatorsm...@gmail.com> wrote:
>>>
>>>> +1
>>>>
>>>> Hyukjin Kwon <gurwls...@apache.org> 于2024年8月12日周一 16:18写道：
>>>>
>>>>> +1
>>>>>
>>>>> On Tue, Aug 13, 2024 at 7:04 AM Nicholas Chammas <
>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>
>>>>>> And just for the record, the stats that I screenshotted
>>>>>> <https://lists.apache.org/api/email.lua?attachment=true&id=jd1hyq6c9v1qg0ym5qlct8lgcxk9yd6z&file=7a28ae0d6eb4c25e047ff90601a941f7acfc3214f837604b545b4f926b8eb628>
>>>>>>  in
>>>>>> that thread I linked to showed the following page views for each
>>>>>> sub-section under `docs/latest/api/`:
>>>>>>
>>>>>> - python: 758K
>>>>>> - java: 66K
>>>>>> - sql: 39K
>>>>>> - scala: 35K
>>>>>> - r: <1K
>>>>>>
>>>>>> I don’t recall over what time period those stats were collected for,
>>>>>> and there are certainly some factors of how the stats are gathered and 
>>>>>> how
>>>>>> the various language API docs are accessed that impact those numbers. So
>>>>>> it’s by no means a solid, objective measure. But I thought it was an
>>>>>> interesting signal nonetheless.
>>>>>>
>>>>>>
>>>>>> On Aug 12, 2024, at 5:50 PM, Nicholas Chammas <
>>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>>
>>>>>> Not an R user myself, but +1.
>>>>>>
>>>>>> I first wondered about the future of SparkR after noticing
>>>>>> <https://lists.apache.org/thread/jd1hyq6c9v1qg0ym5qlct8lgcxk9yd6z> how
>>>>>> low the visit stats were for the R API docs as compared to Python and
>>>>>> Scala. (I can’t seem to find those visit stats
>>>>>> <https://analytics.apache.org/index.php?module=CoreHome&action=index&date=today&period=month&idSite=40#?period=month&date=2024-07-02&idSite=40&category=General_Actions&subcategory=General_Pages>
>>>>>>  for
>>>>>> the API docs anymore.)
>>>>>>
>>>>>>
>>>>>> On Aug 12, 2024, at 11:47 AM, Shivaram Venkataraman <
>>>>>> shivaram.venkatara...@gmail.com> wrote:
>>>>>>
>>>>>> Hi
>>>>>>
>>>>>> About ten years ago, I created the original SparkR package as part of
>>>>>> my research at UC Berkeley [SPARK-5654
>>>>>> <https://issues.apache.org/jira/browse/SPARK-5654>]. After my PhD I
>>>>>> started as a professor at UW-Madison and my contributions to SparkR have
>>>>>> been in the background given my availability. I continue to be involved 
>>>>>> in
>>>>>> the community and teach a popular course at UW-Madison which uses Apache
>>>>>> Spark for programming assignments.
>>>>>>
>>>>>> As the original contributor and author of a research paper on SparkR,
>>>>>> I also continue to get private emails from users. A common question I get
>>>>>> is whether one should use SparkR in Apache Spark or the sparklyr package
>>>>>> (built on top of Apache Spark). You can also see this in StackOverflow
>>>>>> questions and other blog posts online:
>>>>>> https://www.google.com/search?q=sparkr+vs+sparklyr . While, I have
>>>>>> encouraged users to choose the SparkR package as it is maintained by the
>>>>>> Apache project, the more I looked into sparklyr, the more I was convinced
>>>>>> that it is a better choice for R users that want to leverage the power of
>>>>>> Spark:
>>>>>>
>>>>>> (1) sparklyr is developed by a community of developers who understand
>>>>>> the R programming language deeply, and as a result is more idiomatic. In
>>>>>> hindsight, sparklyr’s more idiomatic approach would have been a better
>>>>>> choice than the Scala-like API we have in SparkR.
>>>>>>
>>>>>> (2) Contributions to SparkR have decreased slowly. Over the last two
>>>>>> years, there have been 65 commits on the Spark R codebase (compared to
>>>>>> ~2200 on the Spark Python code base). In contrast Sparklyr has over 300
>>>>>> commits in the same period..
>>>>>>
>>>>>> (3) Previously, using and deploying sparklyr had been cumbersome as
>>>>>> it needed careful alignment of versions between Apache Spark and 
>>>>>> sparklyr.
>>>>>> However, the sparklyr community has implemented a new Spark Connect based
>>>>>> architecture which eliminates this issue.
>>>>>>
>>>>>> (4) The sparklyr community has maintained their package on CRAN – it
>>>>>> takes some effort to do this as the CRAN release process requires 
>>>>>> passing a
>>>>>> number of tests. While SparkR was on CRAN initially, we could not 
>>>>>> maintain
>>>>>> that given our release process and cadence. This makes sparklyr much more
>>>>>> accessible to the R community.
>>>>>>
>>>>>> So it is with a bittersweet feeling that I’m writing this email to
>>>>>> propose that we deprecate SparkR, and recommend sparklyr as the R 
>>>>>> language
>>>>>> binding for Spark. This will reduce complexity of our own codebase, and
>>>>>> more importantly reduce confusion for users. As the sparklyr package is
>>>>>> distributed using the same permissive license as Apache Spark, there 
>>>>>> should
>>>>>> be no downside for existing SparkR users in adopting it.
>>>>>>
>>>>>> My proposal is to mark SparkR as deprecated in the upcoming Spark 4
>>>>>> release, and remove it from Apache Spark with the following major 
>>>>>> release,
>>>>>> Spark 5.
>>>>>>
>>>>>> I’m looking forward to hearing your thoughts and feedback on this
>>>>>> proposal and I’m happy to create the SPIP ticket for a vote on this
>>>>>> proposal using this email thread as the justification.
>>>>>>
>>>>>> Thanks
>>>>>> Shivaram
>>>>>>
>>>>>>
>>>>>>
>>>>>>

Re: [DISCUSS] Deprecating SparkR

Reply via email to