+1 It’s actually great that projects outside Spark’s repo can be more successful than the projects inside. A testament to both Spark itself and Spark Connect!
On Tue, Aug 13, 2024 at 10:00 AM Martin Grund <mar...@databricks.com.invalid> wrote: > +1 > > On Tue, Aug 13, 2024 at 7:26 AM Ruifeng Zheng <ruife...@apache.org> wrote: > >> +1 >> >> On Tue, Aug 13, 2024 at 1:08 PM Holden Karau <holden.ka...@gmail.com> >> wrote: >> >>> +1 >>> >>> Are the sparklyr folks on this list? >>> >>> Twitter: https://twitter.com/holdenkarau >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> Pronouns: she/her >>> >>> >>> On Mon, Aug 12, 2024 at 5:22 PM Xiao Li <gatorsm...@gmail.com> wrote: >>> >>>> +1 >>>> >>>> Hyukjin Kwon <gurwls...@apache.org> 于2024年8月12日周一 16:18写道: >>>> >>>>> +1 >>>>> >>>>> On Tue, Aug 13, 2024 at 7:04 AM Nicholas Chammas < >>>>> nicholas.cham...@gmail.com> wrote: >>>>> >>>>>> And just for the record, the stats that I screenshotted >>>>>> <https://lists.apache.org/api/email.lua?attachment=true&id=jd1hyq6c9v1qg0ym5qlct8lgcxk9yd6z&file=7a28ae0d6eb4c25e047ff90601a941f7acfc3214f837604b545b4f926b8eb628> >>>>>> in >>>>>> that thread I linked to showed the following page views for each >>>>>> sub-section under `docs/latest/api/`: >>>>>> >>>>>> - python: 758K >>>>>> - java: 66K >>>>>> - sql: 39K >>>>>> - scala: 35K >>>>>> - r: <1K >>>>>> >>>>>> I don’t recall over what time period those stats were collected for, >>>>>> and there are certainly some factors of how the stats are gathered and >>>>>> how >>>>>> the various language API docs are accessed that impact those numbers. So >>>>>> it’s by no means a solid, objective measure. But I thought it was an >>>>>> interesting signal nonetheless. >>>>>> >>>>>> >>>>>> On Aug 12, 2024, at 5:50 PM, Nicholas Chammas < >>>>>> nicholas.cham...@gmail.com> wrote: >>>>>> >>>>>> Not an R user myself, but +1. >>>>>> >>>>>> I first wondered about the future of SparkR after noticing >>>>>> <https://lists.apache.org/thread/jd1hyq6c9v1qg0ym5qlct8lgcxk9yd6z> how >>>>>> low the visit stats were for the R API docs as compared to Python and >>>>>> Scala. (I can’t seem to find those visit stats >>>>>> <https://analytics.apache.org/index.php?module=CoreHome&action=index&date=today&period=month&idSite=40#?period=month&date=2024-07-02&idSite=40&category=General_Actions&subcategory=General_Pages> >>>>>> for >>>>>> the API docs anymore.) >>>>>> >>>>>> >>>>>> On Aug 12, 2024, at 11:47 AM, Shivaram Venkataraman < >>>>>> shivaram.venkatara...@gmail.com> wrote: >>>>>> >>>>>> Hi >>>>>> >>>>>> About ten years ago, I created the original SparkR package as part of >>>>>> my research at UC Berkeley [SPARK-5654 >>>>>> <https://issues.apache.org/jira/browse/SPARK-5654>]. After my PhD I >>>>>> started as a professor at UW-Madison and my contributions to SparkR have >>>>>> been in the background given my availability. I continue to be involved >>>>>> in >>>>>> the community and teach a popular course at UW-Madison which uses Apache >>>>>> Spark for programming assignments. >>>>>> >>>>>> As the original contributor and author of a research paper on SparkR, >>>>>> I also continue to get private emails from users. A common question I get >>>>>> is whether one should use SparkR in Apache Spark or the sparklyr package >>>>>> (built on top of Apache Spark). You can also see this in StackOverflow >>>>>> questions and other blog posts online: >>>>>> https://www.google.com/search?q=sparkr+vs+sparklyr . While, I have >>>>>> encouraged users to choose the SparkR package as it is maintained by the >>>>>> Apache project, the more I looked into sparklyr, the more I was convinced >>>>>> that it is a better choice for R users that want to leverage the power of >>>>>> Spark: >>>>>> >>>>>> (1) sparklyr is developed by a community of developers who understand >>>>>> the R programming language deeply, and as a result is more idiomatic. In >>>>>> hindsight, sparklyr’s more idiomatic approach would have been a better >>>>>> choice than the Scala-like API we have in SparkR. >>>>>> >>>>>> (2) Contributions to SparkR have decreased slowly. Over the last two >>>>>> years, there have been 65 commits on the Spark R codebase (compared to >>>>>> ~2200 on the Spark Python code base). In contrast Sparklyr has over 300 >>>>>> commits in the same period.. >>>>>> >>>>>> (3) Previously, using and deploying sparklyr had been cumbersome as >>>>>> it needed careful alignment of versions between Apache Spark and >>>>>> sparklyr. >>>>>> However, the sparklyr community has implemented a new Spark Connect based >>>>>> architecture which eliminates this issue. >>>>>> >>>>>> (4) The sparklyr community has maintained their package on CRAN – it >>>>>> takes some effort to do this as the CRAN release process requires >>>>>> passing a >>>>>> number of tests. While SparkR was on CRAN initially, we could not >>>>>> maintain >>>>>> that given our release process and cadence. This makes sparklyr much more >>>>>> accessible to the R community. >>>>>> >>>>>> So it is with a bittersweet feeling that I’m writing this email to >>>>>> propose that we deprecate SparkR, and recommend sparklyr as the R >>>>>> language >>>>>> binding for Spark. This will reduce complexity of our own codebase, and >>>>>> more importantly reduce confusion for users. As the sparklyr package is >>>>>> distributed using the same permissive license as Apache Spark, there >>>>>> should >>>>>> be no downside for existing SparkR users in adopting it. >>>>>> >>>>>> My proposal is to mark SparkR as deprecated in the upcoming Spark 4 >>>>>> release, and remove it from Apache Spark with the following major >>>>>> release, >>>>>> Spark 5. >>>>>> >>>>>> I’m looking forward to hearing your thoughts and feedback on this >>>>>> proposal and I’m happy to create the SPIP ticket for a vote on this >>>>>> proposal using this email thread as the justification. >>>>>> >>>>>> Thanks >>>>>> Shivaram >>>>>> >>>>>> >>>>>> >>>>>>