-1
El sáb, 14 dic 2024 a las 1:36, Dongjoon Hyun (<dongjoon.h...@gmail.com>) escribió: > For the RDD part, I also disagree with Martin. > I believe RDD should be supported permanently as the public API. > Otherwise, it would be a surprise to me and my colleagues at least. > > > I would assume that we all agree that > > 99% of the _new_ users in Spark should not try to write code in RDDs. > > According to this long discussion context, > I also decided to switch my vote from +1 to -1 > because it seems too early to make this decision > given the pending `Spark Connect` work and active discussion. > Previously, I was biased only on the SQL part too much. > > As a side note, I hope Apache Spark 4.0.0 release is not going > to be blocked by the `Spark Connect` pending work and decision. > > Dongjoon. > > On Tue, Dec 3, 2024 at 7:51 PM Holden Karau <holden.ka...@gmail.com> > wrote: > >> >> >> Twitter: https://twitter.com/holdenkarau >> Fight Health Insurance: https://www.fighthealthinsurance.com/ >> <https://www.fighthealthinsurance.com/?q=hk_email> >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> Pronouns: she/her >> >> >> On Fri, Nov 29, 2024 at 12:24 AM Martin Grund <mar...@databricks.com> >> wrote: >> >>> At the chance of repeating what Herman said word by word :) I would like >>> to call out the following: >>> >>> 1. The goal of setting the default is to guide users to use the >>> Spark SQL APIs that have proven over time. We shouldn't underestimate the >>> power of the default. I would assume that we all agree that 99% of the >>> _new_ users in Spark should not try to write code in RDDs. >>> >>> I would disagree here. Maybe like 75% >> >>> >>> 1. >>> 2. Any user, organization, or vendor can leverage *all* of their >>> existing code by simply changing *one* configuration during startup: >>> switching the spark.api.mode to classic (e.g., similar to ANSI mode). >>> This >>> means all existing RDD and library code just works fine. >>> >>> 3. Creating a fractured user experience by using some logic to >>> identify which API mode is used is not ideal. For many of the use cases >>> that I've seen that require additional jars (e.g., data sources, >>> drivers), >>> they just work fine because Spark already has the right abstractions. For >>> JARs used in the client side part of the code they just work as Herman >>> said. >>> >>> Introducing the config flag defaulting to a limited API already >> introduces a fractured user experience where an application may fail part >> way through running. >> >>> >>> 1. >>> >>> Similarly based on the experience of running Spark Connect in >>> production, the co-existence of workloads running in classic mode and >>> connect mode is working fine. >>> >>> >> I still don’t like classic mode (maybe “full” and “restricted”). >> >>> >>> >>> On Fri, Nov 29, 2024 at 3:18 AM Holden Karau <holden.ka...@gmail.com> >>> wrote: >>> >>>> I would switch to +0 if the default of connect was only for apps >>>> without any user provided jars/non-JVM apps. >>>> >>>> Twitter: https://twitter.com/holdenkarau >>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> Pronouns: she/her >>>> >>>> >>>> On Thu, Nov 28, 2024 at 6:11 PM Holden Karau <holden.ka...@gmail.com> >>>> wrote: >>>> >>>>> Given there is no plan to support RDDs I’ll update to -0.9 >>>>> >>>>> >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> Pronouns: she/her >>>>> >>>>> >>>>> On Thu, Nov 28, 2024 at 6:00 PM Herman van Hovell < >>>>> her...@databricks.com> wrote: >>>>> >>>>>> Hi Holden and Mridul, >>>>>> >>>>>> Just to be clear. What API parity are you expecting here? We have >>>>>> parity for everything that is exposed in org.apache.spark.sql. >>>>>> Connect does not support RDDs, SparkContext, etc... There are >>>>>> currently no plans to support this. We are considering adding a >>>>>> compatibility layer but that will be limited in scope. From running >>>>>> Connect >>>>>> in production for the last year, we see that most users can migrate their >>>>>> workloads without any problems. >>>>>> >>>>>> I do want to call out that this proposal is mostly aimed at how new >>>>>> users will interact with Spark. Existing users, when they migrate their >>>>>> application to Spark 4, have to set a conf when it turns out their >>>>>> application is not working. This should be a minor inconvenience compared >>>>>> to the headaches that a new Scala version or other library upgrades can >>>>>> cause. >>>>>> >>>>>> Since this is a breaking change, I do think this should be done in a >>>>>> major version. >>>>>> >>>>>> With the risk of repeating the SPIP, using Connect as the default >>>>>> brings a lot to the table (e.g. simplicity, easier upgrades, >>>>>> extensibility, >>>>>> etc...), I'd urge you to also factor this into your decision making. >>>>>> >>>>>> Happy thanksgiving! >>>>>> >>>>>> Cheers, >>>>>> Herman >>>>>> >>>>>> On Thu, Nov 28, 2024 at 8:43 PM Mridul Muralidharan <mri...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I agree with Holden, I am leaning -1 on the proposal as well. >>>>>>> Unlike removal of deprecated features, which we align on a major >>>>>>> version boundary, changing the default is something we can do in a minor >>>>>>> version as well - once there is api parity. >>>>>>> >>>>>>> Irrespective of which major/minor version we make the switch in - >>>>>>> there could be user impact; minimizing this impact would be greatly >>>>>>> appreciated by our users. >>>>>>> >>>>>>> Regards, >>>>>>> Mridul >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Nov 27, 2024 at 8:31 PM Holden Karau <holden.ka...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> -0.5: I don’t think this a good idea for JVM apps until we have API >>>>>>>> parity. (Binding but to be clear not a veto) >>>>>>>> >>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>>> Pronouns: she/her >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Nov 27, 2024 at 6:27 PM Xinrong Meng <xinr...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> +1 >>>>>>>>> >>>>>>>>> Thank you Herman! >>>>>>>>> >>>>>>>>> On Thu, Nov 28, 2024 at 3:37 AM Dongjoon Hyun < >>>>>>>>> dongjoon.h...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> +1 >>>>>>>>>> >>>>>>>>>> On Wed, Nov 27, 2024 at 09:16 Denny Lee <denny.g....@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> +1 (non-binding) >>>>>>>>>>> >>>>>>>>>>> On Wed, Nov 27, 2024 at 3:07 AM Martin Grund >>>>>>>>>>> <mar...@databricks.com.invalid> wrote: >>>>>>>>>>> >>>>>>>>>>>> As part of the discussion on this topic, I would love to >>>>>>>>>>>> highlight the work that the community is currently doing to support >>>>>>>>>>>> SparkML, which is traditionally very RDD-heavy, natively in Spark >>>>>>>>>>>> Connect. >>>>>>>>>>>> Bobby's awesome work shows that, over time, we can extend the >>>>>>>>>>>> features of >>>>>>>>>>>> Spark Connect and support workloads that we previously thought >>>>>>>>>>>> could not be >>>>>>>>>>>> supported easily. >>>>>>>>>>>> >>>>>>>>>>>> https://github.com/apache/spark/pull/48791 >>>>>>>>>>>> >>>>>>>>>>>> Martin >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Nov 27, 2024 at 11:42 AM Yang,Jie(INF) >>>>>>>>>>>> <yangji...@baidu.com.invalid> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> +1 >>>>>>>>>>>>> -------- 原始邮件 -------- >>>>>>>>>>>>> 发件人:Hyukjin Kwon<gurwls...@apache.org> >>>>>>>>>>>>> 时间:2024-11-27 08:04:06 >>>>>>>>>>>>> 主题:[外部邮件] Re: Spark Connect the default API in Spark 4.0 >>>>>>>>>>>>> 收件人:Bjørn Jørgensen<bjornjorgen...@gmail.com>; >>>>>>>>>>>>> 抄送人:Herman van Hovell<her...@databricks.com.invalid>;Spark >>>>>>>>>>>>> dev list<dev@spark.apache.org>; >>>>>>>>>>>>> +1 >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, 25 Nov 2024 at 23:33, Bjørn Jørgensen < >>>>>>>>>>>>> bjornjorgen...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> +1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> man. 25. nov. 2024 kl. 14:48 skrev Herman van Hovell >>>>>>>>>>>>>> <her...@databricks.com.invalid>: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi All, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I would like to start a discussion on "Spark Connect the >>>>>>>>>>>>>>> default API in Spark 4.0". >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The rationale for this change is that Spark Connect brings a >>>>>>>>>>>>>>> lot of improvements with respect to simplicity, stability, >>>>>>>>>>>>>>> isolation, >>>>>>>>>>>>>>> upgradability, and extensibility (all detailed in the SPIP). In >>>>>>>>>>>>>>> a nutshell: >>>>>>>>>>>>>>> we want to introduce a flag, spark.api.mode, that allows a >>>>>>>>>>>>>>> user to choose between classic or connect mode, the default >>>>>>>>>>>>>>> being connect. A user can easily fallback to Classic by >>>>>>>>>>>>>>> setting spark.api.mode to classic. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> SPIP: >>>>>>>>>>>>>>> https://docs.google.com/document/d/1C0kuQEliG78HujVwdnSk0wjNwHEDdwo2o8aVq7kbhTo/edit?tab=t.0#heading=h.r2c3xrbiklu3 >>>>>>>>>>>>>>> <https://mailshield.baidu.com/check?q=5uIK5BsJhkKEitTyTno8Yb7Zq%2boLHvRsgSoBr5oTNJEHXWS9Np0U8pCuv2DeJDfCQJiI52FAoCrxDEqnj1jOqX9A3jtJcetvkKkKE696xfrLfKuuRuyCC9YrwN5IW4OUtkhdHz7C%2bER2GN9EPqnlIlX2osm36Zbn> >>>>>>>>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-50411 >>>>>>>>>>>>>>> <https://mailshield.baidu.com/check?q=vc5arXeK3OKfjk5Oxe1F%2fMNjR%2fSx5pTdbaOArWe9m2MpZDOF702CYYagPMQmbDqV7xnWwxsUdOc%3d> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I am looking forward to your feedback! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>> Herman >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Bjørn Jørgensen >>>>>>>>>>>>>> Vestre Aspehaug 4, 6010 Ålesund >>>>>>>>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++%0D%0ANorge?entry=gmail&source=g> >>>>>>>>>>>>>> Norge >>>>>>>>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++%0D%0ANorge?entry=gmail&source=g> >>>>>>>>>>>>>> >>>>>>>>>>>>>> +47 480 94 297 >>>>>>>>>>>>>> >>>>>>>>>>>>>