Re: [外部邮件] Re: Spark Connect the default API in Spark 4.0

Ángel Wed, 18 Dec 2024 17:15:36 -0800

Just taken from the Spark Connect SPIP Doc
<https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj>
.:


"Spark Connect is not meant to be the generic interface for everything that
Spark can do, but provides access to an opinionated subset of Spark
features."

So ... were we trying to config an option that "is not meant to be the
generic interface" as the default behavior?



El sáb, 14 dic 2024, 22:49, Holden Karau <holden.ka...@gmail.com> escribió:

> I think what people are saying rather consistently is that we want the RDD
> APIs to be available by default and not require additional config changes.
> RDDs are a core API that is frequently used.
>
> Personally I think this proposal would have had more success if this was
> introduced for new languages only and not breaking existing RDD users.
>
> But now that it has -1s, including PMC members, I think this proposal is
> should be dropped and if it’s something we as a project want to consider in
> the future it will need some more work to build consensus.
>
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>
>
> On Sat, Dec 14, 2024 at 1:03 PM Martin Grund <mar...@databricks.com>
> wrote:
>
>> Dongjoon, nobody is saying that RDD should not be part of the public API.
>> It is very important to understand the difference here.
>>
>> I've articulated this before and will try again. It is possible that
>> existing workloads require RDDs and these are very much supported by
>> setting the spark conf for the API mode. This is similar to the other spark
>> confs any deployment of an application sets to be configured accordingly.
>>
>> The guidance with Spark Connect as a default is to provide a path where
>> the new future developers and users of Spark leverage the declarative
>> interface by default.
>>
>> I would really like to look at this proposal as a forward looking
>> decision that aims to ease the life of Spark users with better classpath
>> isolation, better upgrade behavior and better application integration. The
>> goal is to optimize for the new users and workloads that will come over
>> time while allowing all existing workloads to run by setting exactly one
>> spark conf.
>>
>>
>> On Sat, Dec 14, 2024 at 04:22 Ángel <angel.alvarez.pas...@gmail.com>
>> wrote:
>>
>>> -1
>>>
>>>
>>> El sáb, 14 dic 2024 a las 1:36, Dongjoon Hyun (<dongjoon.h...@gmail.com>)
>>> escribió:
>>>
>>>> For the RDD part, I also disagree with Martin.
>>>> I believe RDD should be supported permanently as the public API.
>>>> Otherwise, it would be a surprise to me and my colleagues at least.
>>>>
>>>> >  I would assume that we all agree that
>>>> > 99% of the _new_ users in Spark should not try to write code in RDDs.
>>>>
>>>> According to this long discussion context,
>>>> I also decided to switch my vote from +1 to -1
>>>> because it seems too early to make this decision
>>>> given the pending `Spark Connect` work and active discussion.
>>>> Previously, I was biased only on the SQL part too much.
>>>>
>>>> As a side note, I hope Apache Spark 4.0.0 release is not going
>>>> to be blocked by the `Spark Connect` pending work and decision.
>>>>
>>>> Dongjoon.
>>>>
>>>> On Tue, Dec 3, 2024 at 7:51 PM Holden Karau <holden.ka...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>> Pronouns: she/her
>>>>>
>>>>>
>>>>> On Fri, Nov 29, 2024 at 12:24 AM Martin Grund <mar...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> At the chance of repeating what Herman said word by word :) I would
>>>>>> like to call out the following:
>>>>>>
>>>>>>    1. The goal of setting the default is to guide users to use the
>>>>>>    Spark SQL APIs that have proven over time. We shouldn't underestimate 
>>>>>> the
>>>>>>    power of the default. I would assume that we all agree that 99% of the
>>>>>>    _new_ users in Spark should not try to write code in RDDs.
>>>>>>
>>>>>> I would disagree here. Maybe like 75%
>>>>>
>>>>>>
>>>>>>    1.
>>>>>>    2. Any user, organization, or vendor can leverage *all* of their
>>>>>>    existing code by simply changing *one* configuration during
>>>>>>    startup: switching the spark.api.mode to classic (e.g., similar to 
>>>>>> ANSI
>>>>>>    mode). This means all existing RDD and library code just works fine.
>>>>>>
>>>>>>    3. Creating a fractured user experience by using some logic to
>>>>>>    identify which API mode is used is not ideal. For many of the use 
>>>>>> cases
>>>>>>    that I've seen that require additional jars (e.g., data sources, 
>>>>>> drivers),
>>>>>>    they just work fine because Spark already has the right abstractions. 
>>>>>> For
>>>>>>    JARs used in the client side part of the code they just work as Herman
>>>>>>    said.
>>>>>>
>>>>>> Introducing the config flag defaulting to a limited API already
>>>>> introduces a fractured user experience where an application may fail part
>>>>> way through running.
>>>>>
>>>>>>
>>>>>>    1.
>>>>>>
>>>>>> Similarly based on the experience of running Spark Connect in
>>>>>> production, the co-existence of workloads running in classic mode and
>>>>>> connect mode is working fine.
>>>>>>
>>>>>>
>>>>> I still don’t like classic mode (maybe “full” and “restricted”).
>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 29, 2024 at 3:18 AM Holden Karau <holden.ka...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I would switch to +0 if the default of connect was only for apps
>>>>>>> without any user provided jars/non-JVM apps.
>>>>>>>
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>> Pronouns: she/her
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 28, 2024 at 6:11 PM Holden Karau <holden.ka...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Given there is no plan to support RDDs I’ll update to -0.9
>>>>>>>>
>>>>>>>>
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>> Pronouns: she/her
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 28, 2024 at 6:00 PM Herman van Hovell <
>>>>>>>> her...@databricks.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Holden and Mridul,
>>>>>>>>>
>>>>>>>>> Just to be clear. What API parity are you expecting here? We have
>>>>>>>>> parity for everything that is exposed in org.apache.spark.sql.
>>>>>>>>> Connect does not support RDDs, SparkContext, etc... There are
>>>>>>>>> currently no plans to support this. We are considering adding a
>>>>>>>>> compatibility layer but that will be limited in scope. From running 
>>>>>>>>> Connect
>>>>>>>>> in production for the last year, we see that most users can migrate 
>>>>>>>>> their
>>>>>>>>> workloads without any problems.
>>>>>>>>>
>>>>>>>>> I do want to call out that this proposal is mostly aimed at how
>>>>>>>>> new users will interact with Spark. Existing users, when they migrate 
>>>>>>>>> their
>>>>>>>>> application to Spark 4, have to set a conf when it turns out their
>>>>>>>>> application is not working. This should be a minor inconvenience 
>>>>>>>>> compared
>>>>>>>>> to the headaches that a new Scala version or other library upgrades 
>>>>>>>>> can
>>>>>>>>> cause.
>>>>>>>>>
>>>>>>>>> Since this is a breaking change, I do think this should be done in
>>>>>>>>> a major version.
>>>>>>>>>
>>>>>>>>> With the risk of repeating the SPIP, using Connect as the default
>>>>>>>>> brings a lot to the table (e.g. simplicity, easier upgrades, 
>>>>>>>>> extensibility,
>>>>>>>>> etc...), I'd urge you to also factor this into your decision making.
>>>>>>>>>
>>>>>>>>> Happy thanksgiving!
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Herman
>>>>>>>>>
>>>>>>>>> On Thu, Nov 28, 2024 at 8:43 PM Mridul Muralidharan <
>>>>>>>>> mri...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>   I agree with Holden, I am leaning -1 on the proposal as well.
>>>>>>>>>> Unlike removal of deprecated features, which we align on a major
>>>>>>>>>> version boundary, changing the default is something we can do in a 
>>>>>>>>>> minor
>>>>>>>>>> version as well - once there is api parity.
>>>>>>>>>>
>>>>>>>>>> Irrespective of which major/minor version we make the switch in -
>>>>>>>>>> there could be user impact; minimizing this impact would be greatly
>>>>>>>>>> appreciated by our users.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Mridul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Nov 27, 2024 at 8:31 PM Holden Karau <
>>>>>>>>>> holden.ka...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> -0.5: I don’t think this a good idea for JVM apps until we have
>>>>>>>>>>> API parity. (Binding but to be clear not a veto)
>>>>>>>>>>>
>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>>>> Pronouns: she/her
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Nov 27, 2024 at 6:27 PM Xinrong Meng <xinr...@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> +1
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you Herman!
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Nov 28, 2024 at 3:37 AM Dongjoon Hyun <
>>>>>>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> +1
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Nov 27, 2024 at 09:16 Denny Lee <denny.g....@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +1 (non-binding)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Nov 27, 2024 at 3:07 AM Martin Grund
>>>>>>>>>>>>>> <mar...@databricks.com.invalid> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As part of the discussion on this topic, I would love to
>>>>>>>>>>>>>>> highlight the work that the community is currently doing to 
>>>>>>>>>>>>>>> support
>>>>>>>>>>>>>>> SparkML, which is traditionally very RDD-heavy, natively in 
>>>>>>>>>>>>>>> Spark Connect.
>>>>>>>>>>>>>>> Bobby's awesome work shows that, over time, we can extend the 
>>>>>>>>>>>>>>> features of
>>>>>>>>>>>>>>> Spark Connect and support workloads that we previously thought 
>>>>>>>>>>>>>>> could not be
>>>>>>>>>>>>>>> supported easily.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://github.com/apache/spark/pull/48791
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Martin
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Nov 27, 2024 at 11:42 AM Yang,Jie(INF)
>>>>>>>>>>>>>>> <yangji...@baidu.com.invalid> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>>> -------- 原始邮件 --------
>>>>>>>>>>>>>>>> 发件人：Hyukjin Kwon<gurwls...@apache.org>
>>>>>>>>>>>>>>>> 时间：2024-11-27 08:04:06
>>>>>>>>>>>>>>>> 主题：[外部邮件] Re： Spark Connect the default API in Spark 4.0
>>>>>>>>>>>>>>>> 收件人：Bjørn Jørgensen<bjornjorgen...@gmail.com>;
>>>>>>>>>>>>>>>> 抄送人：Herman van Hovell<her...@databricks.com.invalid>;Spark
>>>>>>>>>>>>>>>> dev list<dev@spark.apache.org>;
>>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, 25 Nov 2024 at 23:33, Bjørn Jørgensen <
>>>>>>>>>>>>>>>> bjornjorgen...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> man. 25. nov. 2024 kl. 14:48 skrev Herman van Hovell
>>>>>>>>>>>>>>>>> <her...@databricks.com.invalid>:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I would like to start a discussion on "Spark Connect the
>>>>>>>>>>>>>>>>>> default API in Spark 4.0".
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The rationale for this change is that Spark Connect
>>>>>>>>>>>>>>>>>> brings a lot of improvements with respect to simplicity, 
>>>>>>>>>>>>>>>>>> stability,
>>>>>>>>>>>>>>>>>> isolation, upgradability, and extensibility (all detailed in 
>>>>>>>>>>>>>>>>>> the SPIP). In
>>>>>>>>>>>>>>>>>> a nutshell: we want to introduce a flag, spark.api.mode,
>>>>>>>>>>>>>>>>>> that allows a user to choose between classic or connect
>>>>>>>>>>>>>>>>>> mode, the default being connect. A user can easily
>>>>>>>>>>>>>>>>>> fallback to Classic by setting spark.api.mode to classic.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> SPIP:
>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1C0kuQEliG78HujVwdnSk0wjNwHEDdwo2o8aVq7kbhTo/edit?tab=t.0#heading=h.r2c3xrbiklu3
>>>>>>>>>>>>>>>>>> <https://mailshield.baidu.com/check?q=5uIK5BsJhkKEitTyTno8Yb7Zq%2boLHvRsgSoBr5oTNJEHXWS9Np0U8pCuv2DeJDfCQJiI52FAoCrxDEqnj1jOqX9A3jtJcetvkKkKE696xfrLfKuuRuyCC9YrwN5IW4OUtkhdHz7C%2bER2GN9EPqnlIlX2osm36Zbn>
>>>>>>>>>>>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-50411
>>>>>>>>>>>>>>>>>> <https://mailshield.baidu.com/check?q=vc5arXeK3OKfjk5Oxe1F%2fMNjR%2fSx5pTdbaOArWe9m2MpZDOF702CYYagPMQmbDqV7xnWwxsUdOc%3d>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I am looking forward to your feedback!
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>> Herman
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Bjørn Jørgensen
>>>>>>>>>>>>>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>>>>>>>>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++%0D%0ANorge?entry=gmail&source=g>
>>>>>>>>>>>>>>>>> Norge
>>>>>>>>>>>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++%0D%0ANorge?entry=gmail&source=g>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +47 480 94 297
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>

Re: [外部邮件] Re: Spark Connect the default API in Spark 4.0

Reply via email to