I think what people are saying rather consistently is that we want the RDD APIs to be available by default and not require additional config changes. RDDs are a core API that is frequently used.
Personally I think this proposal would have had more success if this was introduced for new languages only and not breaking existing RDD users. But now that it has -1s, including PMC members, I think this proposal is should be dropped and if it’s something we as a project want to consider in the future it will need some more work to build consensus. Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her On Sat, Dec 14, 2024 at 1:03 PM Martin Grund <mar...@databricks.com> wrote: > Dongjoon, nobody is saying that RDD should not be part of the public API. > It is very important to understand the difference here. > > I've articulated this before and will try again. It is possible that > existing workloads require RDDs and these are very much supported by > setting the spark conf for the API mode. This is similar to the other spark > confs any deployment of an application sets to be configured accordingly. > > The guidance with Spark Connect as a default is to provide a path where > the new future developers and users of Spark leverage the declarative > interface by default. > > I would really like to look at this proposal as a forward looking decision > that aims to ease the life of Spark users with better classpath isolation, > better upgrade behavior and better application integration. The goal is to > optimize for the new users and workloads that will come over time while > allowing all existing workloads to run by setting exactly one spark conf. > > > On Sat, Dec 14, 2024 at 04:22 Ángel <angel.alvarez.pas...@gmail.com> > wrote: > >> -1 >> >> >> El sáb, 14 dic 2024 a las 1:36, Dongjoon Hyun (<dongjoon.h...@gmail.com>) >> escribió: >> >>> For the RDD part, I also disagree with Martin. >>> I believe RDD should be supported permanently as the public API. >>> Otherwise, it would be a surprise to me and my colleagues at least. >>> >>> > I would assume that we all agree that >>> > 99% of the _new_ users in Spark should not try to write code in RDDs. >>> >>> According to this long discussion context, >>> I also decided to switch my vote from +1 to -1 >>> because it seems too early to make this decision >>> given the pending `Spark Connect` work and active discussion. >>> Previously, I was biased only on the SQL part too much. >>> >>> As a side note, I hope Apache Spark 4.0.0 release is not going >>> to be blocked by the `Spark Connect` pending work and decision. >>> >>> Dongjoon. >>> >>> On Tue, Dec 3, 2024 at 7:51 PM Holden Karau <holden.ka...@gmail.com> >>> wrote: >>> >>>> >>>> >>>> Twitter: https://twitter.com/holdenkarau >>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> Pronouns: she/her >>>> >>>> >>>> On Fri, Nov 29, 2024 at 12:24 AM Martin Grund <mar...@databricks.com> >>>> wrote: >>>> >>>>> At the chance of repeating what Herman said word by word :) I would >>>>> like to call out the following: >>>>> >>>>> 1. The goal of setting the default is to guide users to use the >>>>> Spark SQL APIs that have proven over time. We shouldn't underestimate >>>>> the >>>>> power of the default. I would assume that we all agree that 99% of the >>>>> _new_ users in Spark should not try to write code in RDDs. >>>>> >>>>> I would disagree here. Maybe like 75% >>>> >>>>> >>>>> 1. >>>>> 2. Any user, organization, or vendor can leverage *all* of their >>>>> existing code by simply changing *one* configuration during >>>>> startup: switching the spark.api.mode to classic (e.g., similar to ANSI >>>>> mode). This means all existing RDD and library code just works fine. >>>>> >>>>> 3. Creating a fractured user experience by using some logic to >>>>> identify which API mode is used is not ideal. For many of the use cases >>>>> that I've seen that require additional jars (e.g., data sources, >>>>> drivers), >>>>> they just work fine because Spark already has the right abstractions. >>>>> For >>>>> JARs used in the client side part of the code they just work as Herman >>>>> said. >>>>> >>>>> Introducing the config flag defaulting to a limited API already >>>> introduces a fractured user experience where an application may fail part >>>> way through running. >>>> >>>>> >>>>> 1. >>>>> >>>>> Similarly based on the experience of running Spark Connect in >>>>> production, the co-existence of workloads running in classic mode and >>>>> connect mode is working fine. >>>>> >>>>> >>>> I still don’t like classic mode (maybe “full” and “restricted”). >>>> >>>>> >>>>> >>>>> On Fri, Nov 29, 2024 at 3:18 AM Holden Karau <holden.ka...@gmail.com> >>>>> wrote: >>>>> >>>>>> I would switch to +0 if the default of connect was only for apps >>>>>> without any user provided jars/non-JVM apps. >>>>>> >>>>>> Twitter: https://twitter.com/holdenkarau >>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>> Pronouns: she/her >>>>>> >>>>>> >>>>>> On Thu, Nov 28, 2024 at 6:11 PM Holden Karau <holden.ka...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Given there is no plan to support RDDs I’ll update to -0.9 >>>>>>> >>>>>>> >>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>> Pronouns: she/her >>>>>>> >>>>>>> >>>>>>> On Thu, Nov 28, 2024 at 6:00 PM Herman van Hovell < >>>>>>> her...@databricks.com> wrote: >>>>>>> >>>>>>>> Hi Holden and Mridul, >>>>>>>> >>>>>>>> Just to be clear. What API parity are you expecting here? We have >>>>>>>> parity for everything that is exposed in org.apache.spark.sql. >>>>>>>> Connect does not support RDDs, SparkContext, etc... There are >>>>>>>> currently no plans to support this. We are considering adding a >>>>>>>> compatibility layer but that will be limited in scope. From running >>>>>>>> Connect >>>>>>>> in production for the last year, we see that most users can migrate >>>>>>>> their >>>>>>>> workloads without any problems. >>>>>>>> >>>>>>>> I do want to call out that this proposal is mostly aimed at how new >>>>>>>> users will interact with Spark. Existing users, when they migrate their >>>>>>>> application to Spark 4, have to set a conf when it turns out their >>>>>>>> application is not working. This should be a minor inconvenience >>>>>>>> compared >>>>>>>> to the headaches that a new Scala version or other library upgrades can >>>>>>>> cause. >>>>>>>> >>>>>>>> Since this is a breaking change, I do think this should be done in >>>>>>>> a major version. >>>>>>>> >>>>>>>> With the risk of repeating the SPIP, using Connect as the default >>>>>>>> brings a lot to the table (e.g. simplicity, easier upgrades, >>>>>>>> extensibility, >>>>>>>> etc...), I'd urge you to also factor this into your decision making. >>>>>>>> >>>>>>>> Happy thanksgiving! >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Herman >>>>>>>> >>>>>>>> On Thu, Nov 28, 2024 at 8:43 PM Mridul Muralidharan < >>>>>>>> mri...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I agree with Holden, I am leaning -1 on the proposal as well. >>>>>>>>> Unlike removal of deprecated features, which we align on a major >>>>>>>>> version boundary, changing the default is something we can do in a >>>>>>>>> minor >>>>>>>>> version as well - once there is api parity. >>>>>>>>> >>>>>>>>> Irrespective of which major/minor version we make the switch in - >>>>>>>>> there could be user impact; minimizing this impact would be greatly >>>>>>>>> appreciated by our users. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Mridul >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Nov 27, 2024 at 8:31 PM Holden Karau < >>>>>>>>> holden.ka...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> -0.5: I don’t think this a good idea for JVM apps until we have >>>>>>>>>> API parity. (Binding but to be clear not a veto) >>>>>>>>>> >>>>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>>>>> Pronouns: she/her >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Nov 27, 2024 at 6:27 PM Xinrong Meng <xinr...@apache.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> +1 >>>>>>>>>>> >>>>>>>>>>> Thank you Herman! >>>>>>>>>>> >>>>>>>>>>> On Thu, Nov 28, 2024 at 3:37 AM Dongjoon Hyun < >>>>>>>>>>> dongjoon.h...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> +1 >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Nov 27, 2024 at 09:16 Denny Lee <denny.g....@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> +1 (non-binding) >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Nov 27, 2024 at 3:07 AM Martin Grund >>>>>>>>>>>>> <mar...@databricks.com.invalid> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> As part of the discussion on this topic, I would love to >>>>>>>>>>>>>> highlight the work that the community is currently doing to >>>>>>>>>>>>>> support >>>>>>>>>>>>>> SparkML, which is traditionally very RDD-heavy, natively in >>>>>>>>>>>>>> Spark Connect. >>>>>>>>>>>>>> Bobby's awesome work shows that, over time, we can extend the >>>>>>>>>>>>>> features of >>>>>>>>>>>>>> Spark Connect and support workloads that we previously thought >>>>>>>>>>>>>> could not be >>>>>>>>>>>>>> supported easily. >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://github.com/apache/spark/pull/48791 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Martin >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Nov 27, 2024 at 11:42 AM Yang,Jie(INF) >>>>>>>>>>>>>> <yangji...@baidu.com.invalid> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> +1 >>>>>>>>>>>>>>> -------- 原始邮件 -------- >>>>>>>>>>>>>>> 发件人:Hyukjin Kwon<gurwls...@apache.org> >>>>>>>>>>>>>>> 时间:2024-11-27 08:04:06 >>>>>>>>>>>>>>> 主题:[外部邮件] Re: Spark Connect the default API in Spark 4.0 >>>>>>>>>>>>>>> 收件人:Bjørn Jørgensen<bjornjorgen...@gmail.com>; >>>>>>>>>>>>>>> 抄送人:Herman van Hovell<her...@databricks.com.invalid>;Spark >>>>>>>>>>>>>>> dev list<dev@spark.apache.org>; >>>>>>>>>>>>>>> +1 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, 25 Nov 2024 at 23:33, Bjørn Jørgensen < >>>>>>>>>>>>>>> bjornjorgen...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> +1 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> man. 25. nov. 2024 kl. 14:48 skrev Herman van Hovell >>>>>>>>>>>>>>>> <her...@databricks.com.invalid>: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi All, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I would like to start a discussion on "Spark Connect the >>>>>>>>>>>>>>>>> default API in Spark 4.0". >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The rationale for this change is that Spark Connect brings >>>>>>>>>>>>>>>>> a lot of improvements with respect to simplicity, stability, >>>>>>>>>>>>>>>>> isolation, >>>>>>>>>>>>>>>>> upgradability, and extensibility (all detailed in the SPIP). >>>>>>>>>>>>>>>>> In a nutshell: >>>>>>>>>>>>>>>>> we want to introduce a flag, spark.api.mode, that allows >>>>>>>>>>>>>>>>> a user to choose between classic or connect mode, the >>>>>>>>>>>>>>>>> default being connect. A user can easily fallback to >>>>>>>>>>>>>>>>> Classic by setting spark.api.mode to classic. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> SPIP: >>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1C0kuQEliG78HujVwdnSk0wjNwHEDdwo2o8aVq7kbhTo/edit?tab=t.0#heading=h.r2c3xrbiklu3 >>>>>>>>>>>>>>>>> <https://mailshield.baidu.com/check?q=5uIK5BsJhkKEitTyTno8Yb7Zq%2boLHvRsgSoBr5oTNJEHXWS9Np0U8pCuv2DeJDfCQJiI52FAoCrxDEqnj1jOqX9A3jtJcetvkKkKE696xfrLfKuuRuyCC9YrwN5IW4OUtkhdHz7C%2bER2GN9EPqnlIlX2osm36Zbn> >>>>>>>>>>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-50411 >>>>>>>>>>>>>>>>> <https://mailshield.baidu.com/check?q=vc5arXeK3OKfjk5Oxe1F%2fMNjR%2fSx5pTdbaOArWe9m2MpZDOF702CYYagPMQmbDqV7xnWwxsUdOc%3d> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I am looking forward to your feedback! >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>> Herman >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Bjørn Jørgensen >>>>>>>>>>>>>>>> Vestre Aspehaug 4, 6010 Ålesund >>>>>>>>>>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++%0D%0ANorge?entry=gmail&source=g> >>>>>>>>>>>>>>>> Norge >>>>>>>>>>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++%0D%0ANorge?entry=gmail&source=g> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> +47 480 94 297 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>