Re: [外部邮件] Re: Spark Connect the default API in Spark 4.0

Holden Karau Sat, 14 Dec 2024 13:51:13 -0800

I think what people are saying rather consistently is that we want the RDD
APIs to be available by default and not require additional config changes.
RDDs are a core API that is frequently used.


Personally I think this proposal would have had more success if this was
introduced for new languages only and not breaking existing RDD users.

But now that it has -1s, including PMC members, I think this proposal is
should be dropped and if it’s something we as a project want to consider in
the future it will need some more work to build consensus.

Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her


On Sat, Dec 14, 2024 at 1:03 PM Martin Grund <[email protected]> wrote:

> Dongjoon, nobody is saying that RDD should not be part of the public API.
> It is very important to understand the difference here.
>
> I've articulated this before and will try again. It is possible that
> existing workloads require RDDs and these are very much supported by
> setting the spark conf for the API mode. This is similar to the other spark
> confs any deployment of an application sets to be configured accordingly.
>
> The guidance with Spark Connect as a default is to provide a path where
> the new future developers and users of Spark leverage the declarative
> interface by default.
>
> I would really like to look at this proposal as a forward looking decision
> that aims to ease the life of Spark users with better classpath isolation,
> better upgrade behavior and better application integration. The goal is to
> optimize for the new users and workloads that will come over time while
> allowing all existing workloads to run by setting exactly one spark conf.
>
>
> On Sat, Dec 14, 2024 at 04:22 Ángel <[email protected]>
> wrote:
>
>> -1
>>
>>
>> El sáb, 14 dic 2024 a las 1:36, Dongjoon Hyun (<[email protected]>)
>> escribió:
>>
>>> For the RDD part, I also disagree with Martin.
>>> I believe RDD should be supported permanently as the public API.
>>> Otherwise, it would be a surprise to me and my colleagues at least.
>>>
>>> >  I would assume that we all agree that
>>> > 99% of the _new_ users in Spark should not try to write code in RDDs.
>>>
>>> According to this long discussion context,
>>> I also decided to switch my vote from +1 to -1
>>> because it seems too early to make this decision
>>> given the pending `Spark Connect` work and active discussion.
>>> Previously, I was biased only on the SQL part too much.
>>>
>>> As a side note, I hope Apache Spark 4.0.0 release is not going
>>> to be blocked by the `Spark Connect` pending work and decision.
>>>
>>> Dongjoon.
>>>
>>> On Tue, Dec 3, 2024 at 7:51 PM Holden Karau <[email protected]>
>>> wrote:
>>>
>>>>
>>>>
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> Pronouns: she/her
>>>>
>>>>
>>>> On Fri, Nov 29, 2024 at 12:24 AM Martin Grund <[email protected]>
>>>> wrote:
>>>>
>>>>> At the chance of repeating what Herman said word by word :) I would
>>>>> like to call out the following:
>>>>>
>>>>>    1. The goal of setting the default is to guide users to use the
>>>>>    Spark SQL APIs that have proven over time. We shouldn't underestimate 
>>>>> the
>>>>>    power of the default. I would assume that we all agree that 99% of the
>>>>>    _new_ users in Spark should not try to write code in RDDs.
>>>>>
>>>>> I would disagree here. Maybe like 75%
>>>>
>>>>>
>>>>>    1.
>>>>>    2. Any user, organization, or vendor can leverage *all* of their
>>>>>    existing code by simply changing *one* configuration during
>>>>>    startup: switching the spark.api.mode to classic (e.g., similar to ANSI
>>>>>    mode). This means all existing RDD and library code just works fine.
>>>>>
>>>>>    3. Creating a fractured user experience by using some logic to
>>>>>    identify which API mode is used is not ideal. For many of the use cases
>>>>>    that I've seen that require additional jars (e.g., data sources, 
>>>>> drivers),
>>>>>    they just work fine because Spark already has the right abstractions. 
>>>>> For
>>>>>    JARs used in the client side part of the code they just work as Herman
>>>>>    said.
>>>>>
>>>>> Introducing the config flag defaulting to a limited API already
>>>> introduces a fractured user experience where an application may fail part
>>>> way through running.
>>>>
>>>>>
>>>>>    1.
>>>>>
>>>>> Similarly based on the experience of running Spark Connect in
>>>>> production, the co-existence of workloads running in classic mode and
>>>>> connect mode is working fine.
>>>>>
>>>>>
>>>> I still don’t like classic mode (maybe “full” and “restricted”).
>>>>
>>>>>
>>>>>
>>>>> On Fri, Nov 29, 2024 at 3:18 AM Holden Karau <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I would switch to +0 if the default of connect was only for apps
>>>>>> without any user provided jars/non-JVM apps.
>>>>>>
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>> Pronouns: she/her
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 28, 2024 at 6:11 PM Holden Karau <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Given there is no plan to support RDDs I’ll update to -0.9
>>>>>>>
>>>>>>>
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>> Pronouns: she/her
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 28, 2024 at 6:00 PM Herman van Hovell <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Holden and Mridul,
>>>>>>>>
>>>>>>>> Just to be clear. What API parity are you expecting here? We have
>>>>>>>> parity for everything that is exposed in org.apache.spark.sql.
>>>>>>>> Connect does not support RDDs, SparkContext, etc... There are
>>>>>>>> currently no plans to support this. We are considering adding a
>>>>>>>> compatibility layer but that will be limited in scope. From running 
>>>>>>>> Connect
>>>>>>>> in production for the last year, we see that most users can migrate 
>>>>>>>> their
>>>>>>>> workloads without any problems.
>>>>>>>>
>>>>>>>> I do want to call out that this proposal is mostly aimed at how new
>>>>>>>> users will interact with Spark. Existing users, when they migrate their
>>>>>>>> application to Spark 4, have to set a conf when it turns out their
>>>>>>>> application is not working. This should be a minor inconvenience 
>>>>>>>> compared
>>>>>>>> to the headaches that a new Scala version or other library upgrades can
>>>>>>>> cause.
>>>>>>>>
>>>>>>>> Since this is a breaking change, I do think this should be done in
>>>>>>>> a major version.
>>>>>>>>
>>>>>>>> With the risk of repeating the SPIP, using Connect as the default
>>>>>>>> brings a lot to the table (e.g. simplicity, easier upgrades, 
>>>>>>>> extensibility,
>>>>>>>> etc...), I'd urge you to also factor this into your decision making.
>>>>>>>>
>>>>>>>> Happy thanksgiving!
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Herman
>>>>>>>>
>>>>>>>> On Thu, Nov 28, 2024 at 8:43 PM Mridul Muralidharan <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>   I agree with Holden, I am leaning -1 on the proposal as well.
>>>>>>>>> Unlike removal of deprecated features, which we align on a major
>>>>>>>>> version boundary, changing the default is something we can do in a 
>>>>>>>>> minor
>>>>>>>>> version as well - once there is api parity.
>>>>>>>>>
>>>>>>>>> Irrespective of which major/minor version we make the switch in -
>>>>>>>>> there could be user impact; minimizing this impact would be greatly
>>>>>>>>> appreciated by our users.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Mridul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Nov 27, 2024 at 8:31 PM Holden Karau <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> -0.5: I don’t think this a good idea for JVM apps until we have
>>>>>>>>>> API parity. (Binding but to be clear not a veto)
>>>>>>>>>>
>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>>> Pronouns: she/her
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Nov 27, 2024 at 6:27 PM Xinrong Meng <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1
>>>>>>>>>>>
>>>>>>>>>>> Thank you Herman!
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Nov 28, 2024 at 3:37 AM Dongjoon Hyun <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> +1
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Nov 27, 2024 at 09:16 Denny Lee <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> +1 (non-binding)
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Nov 27, 2024 at 3:07 AM Martin Grund
>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> As part of the discussion on this topic, I would love to
>>>>>>>>>>>>>> highlight the work that the community is currently doing to 
>>>>>>>>>>>>>> support
>>>>>>>>>>>>>> SparkML, which is traditionally very RDD-heavy, natively in 
>>>>>>>>>>>>>> Spark Connect.
>>>>>>>>>>>>>> Bobby's awesome work shows that, over time, we can extend the 
>>>>>>>>>>>>>> features of
>>>>>>>>>>>>>> Spark Connect and support workloads that we previously thought 
>>>>>>>>>>>>>> could not be
>>>>>>>>>>>>>> supported easily.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://github.com/apache/spark/pull/48791
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Martin
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Nov 27, 2024 at 11:42 AM Yang,Jie(INF)
>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>> -------- 原始邮件 --------
>>>>>>>>>>>>>>> 发件人：Hyukjin Kwon<[email protected]>
>>>>>>>>>>>>>>> 时间：2024-11-27 08:04:06
>>>>>>>>>>>>>>> 主题：[外部邮件] Re： Spark Connect the default API in Spark 4.0
>>>>>>>>>>>>>>> 收件人：Bjørn Jørgensen<[email protected]>;
>>>>>>>>>>>>>>> 抄送人：Herman van Hovell<[email protected]>;Spark
>>>>>>>>>>>>>>> dev list<[email protected]>;
>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, 25 Nov 2024 at 23:33, Bjørn Jørgensen <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> man. 25. nov. 2024 kl. 14:48 skrev Herman van Hovell
>>>>>>>>>>>>>>>> <[email protected]>:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I would like to start a discussion on "Spark Connect the
>>>>>>>>>>>>>>>>> default API in Spark 4.0".
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The rationale for this change is that Spark Connect brings
>>>>>>>>>>>>>>>>> a lot of improvements with respect to simplicity, stability, 
>>>>>>>>>>>>>>>>> isolation,
>>>>>>>>>>>>>>>>> upgradability, and extensibility (all detailed in the SPIP). 
>>>>>>>>>>>>>>>>> In a nutshell:
>>>>>>>>>>>>>>>>> we want to introduce a flag, spark.api.mode, that allows
>>>>>>>>>>>>>>>>> a user to choose between classic or connect mode, the
>>>>>>>>>>>>>>>>> default being connect. A user can easily fallback to
>>>>>>>>>>>>>>>>> Classic by setting spark.api.mode to classic.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> SPIP:
>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1C0kuQEliG78HujVwdnSk0wjNwHEDdwo2o8aVq7kbhTo/edit?tab=t.0#heading=h.r2c3xrbiklu3
>>>>>>>>>>>>>>>>> <https://mailshield.baidu.com/check?q=5uIK5BsJhkKEitTyTno8Yb7Zq%2boLHvRsgSoBr5oTNJEHXWS9Np0U8pCuv2DeJDfCQJiI52FAoCrxDEqnj1jOqX9A3jtJcetvkKkKE696xfrLfKuuRuyCC9YrwN5IW4OUtkhdHz7C%2bER2GN9EPqnlIlX2osm36Zbn>
>>>>>>>>>>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-50411
>>>>>>>>>>>>>>>>> <https://mailshield.baidu.com/check?q=vc5arXeK3OKfjk5Oxe1F%2fMNjR%2fSx5pTdbaOArWe9m2MpZDOF702CYYagPMQmbDqV7xnWwxsUdOc%3d>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am looking forward to your feedback!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>> Herman
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Bjørn Jørgensen
>>>>>>>>>>>>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>>>>>>>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++%0D%0ANorge?entry=gmail&source=g>
>>>>>>>>>>>>>>>> Norge
>>>>>>>>>>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++%0D%0ANorge?entry=gmail&source=g>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +47 480 94 297
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Re: [外部邮件] Re: Spark Connect the default API in Spark 4.0

Reply via email to