Re: [外部邮件] Re: Spark Connect the default API in Spark 4.0

Ángel Fri, 13 Dec 2024 19:23:05 -0800

-1


El sáb, 14 dic 2024 a las 1:36, Dongjoon Hyun (<[email protected]>)
escribió:

> For the RDD part, I also disagree with Martin.
> I believe RDD should be supported permanently as the public API.
> Otherwise, it would be a surprise to me and my colleagues at least.
>
> >  I would assume that we all agree that
> > 99% of the _new_ users in Spark should not try to write code in RDDs.
>
> According to this long discussion context,
> I also decided to switch my vote from +1 to -1
> because it seems too early to make this decision
> given the pending `Spark Connect` work and active discussion.
> Previously, I was biased only on the SQL part too much.
>
> As a side note, I hope Apache Spark 4.0.0 release is not going
> to be blocked by the `Spark Connect` pending work and decision.
>
> Dongjoon.
>
> On Tue, Dec 3, 2024 at 7:51 PM Holden Karau <[email protected]>
> wrote:
>
>>
>>
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her
>>
>>
>> On Fri, Nov 29, 2024 at 12:24 AM Martin Grund <[email protected]>
>> wrote:
>>
>>> At the chance of repeating what Herman said word by word :) I would like
>>> to call out the following:
>>>
>>>    1. The goal of setting the default is to guide users to use the
>>>    Spark SQL APIs that have proven over time. We shouldn't underestimate the
>>>    power of the default. I would assume that we all agree that 99% of the
>>>    _new_ users in Spark should not try to write code in RDDs.
>>>
>>> I would disagree here. Maybe like 75%
>>
>>>
>>>    1.
>>>    2. Any user, organization, or vendor can leverage *all* of their
>>>    existing code by simply changing *one* configuration during startup:
>>>    switching the spark.api.mode to classic (e.g., similar to ANSI mode). 
>>> This
>>>    means all existing RDD and library code just works fine.
>>>
>>>    3. Creating a fractured user experience by using some logic to
>>>    identify which API mode is used is not ideal. For many of the use cases
>>>    that I've seen that require additional jars (e.g., data sources, 
>>> drivers),
>>>    they just work fine because Spark already has the right abstractions. For
>>>    JARs used in the client side part of the code they just work as Herman
>>>    said.
>>>
>>> Introducing the config flag defaulting to a limited API already
>> introduces a fractured user experience where an application may fail part
>> way through running.
>>
>>>
>>>    1.
>>>
>>> Similarly based on the experience of running Spark Connect in
>>> production, the co-existence of workloads running in classic mode and
>>> connect mode is working fine.
>>>
>>>
>> I still don’t like classic mode (maybe “full” and “restricted”).
>>
>>>
>>>
>>> On Fri, Nov 29, 2024 at 3:18 AM Holden Karau <[email protected]>
>>> wrote:
>>>
>>>> I would switch to +0 if the default of connect was only for apps
>>>> without any user provided jars/non-JVM apps.
>>>>
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> Pronouns: she/her
>>>>
>>>>
>>>> On Thu, Nov 28, 2024 at 6:11 PM Holden Karau <[email protected]>
>>>> wrote:
>>>>
>>>>> Given there is no plan to support RDDs I’ll update to -0.9
>>>>>
>>>>>
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>> Pronouns: she/her
>>>>>
>>>>>
>>>>> On Thu, Nov 28, 2024 at 6:00 PM Herman van Hovell <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Holden and Mridul,
>>>>>>
>>>>>> Just to be clear. What API parity are you expecting here? We have
>>>>>> parity for everything that is exposed in org.apache.spark.sql.
>>>>>> Connect does not support RDDs, SparkContext, etc... There are
>>>>>> currently no plans to support this. We are considering adding a
>>>>>> compatibility layer but that will be limited in scope. From running 
>>>>>> Connect
>>>>>> in production for the last year, we see that most users can migrate their
>>>>>> workloads without any problems.
>>>>>>
>>>>>> I do want to call out that this proposal is mostly aimed at how new
>>>>>> users will interact with Spark. Existing users, when they migrate their
>>>>>> application to Spark 4, have to set a conf when it turns out their
>>>>>> application is not working. This should be a minor inconvenience compared
>>>>>> to the headaches that a new Scala version or other library upgrades can
>>>>>> cause.
>>>>>>
>>>>>> Since this is a breaking change, I do think this should be done in a
>>>>>> major version.
>>>>>>
>>>>>> With the risk of repeating the SPIP, using Connect as the default
>>>>>> brings a lot to the table (e.g. simplicity, easier upgrades, 
>>>>>> extensibility,
>>>>>> etc...), I'd urge you to also factor this into your decision making.
>>>>>>
>>>>>> Happy thanksgiving!
>>>>>>
>>>>>> Cheers,
>>>>>> Herman
>>>>>>
>>>>>> On Thu, Nov 28, 2024 at 8:43 PM Mridul Muralidharan <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>   I agree with Holden, I am leaning -1 on the proposal as well.
>>>>>>> Unlike removal of deprecated features, which we align on a major
>>>>>>> version boundary, changing the default is something we can do in a minor
>>>>>>> version as well - once there is api parity.
>>>>>>>
>>>>>>> Irrespective of which major/minor version we make the switch in -
>>>>>>> there could be user impact; minimizing this impact would be greatly
>>>>>>> appreciated by our users.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Mridul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 27, 2024 at 8:31 PM Holden Karau <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> -0.5: I don’t think this a good idea for JVM apps until we have API
>>>>>>>> parity. (Binding but to be clear not a veto)
>>>>>>>>
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>> Pronouns: she/her
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 27, 2024 at 6:27 PM Xinrong Meng <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1
>>>>>>>>>
>>>>>>>>> Thank you Herman!
>>>>>>>>>
>>>>>>>>> On Thu, Nov 28, 2024 at 3:37 AM Dongjoon Hyun <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> +1
>>>>>>>>>>
>>>>>>>>>> On Wed, Nov 27, 2024 at 09:16 Denny Lee <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1 (non-binding)
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Nov 27, 2024 at 3:07 AM Martin Grund
>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> As part of the discussion on this topic, I would love to
>>>>>>>>>>>> highlight the work that the community is currently doing to support
>>>>>>>>>>>> SparkML, which is traditionally very RDD-heavy, natively in Spark 
>>>>>>>>>>>> Connect.
>>>>>>>>>>>> Bobby's awesome work shows that, over time, we can extend the 
>>>>>>>>>>>> features of
>>>>>>>>>>>> Spark Connect and support workloads that we previously thought 
>>>>>>>>>>>> could not be
>>>>>>>>>>>> supported easily.
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/apache/spark/pull/48791
>>>>>>>>>>>>
>>>>>>>>>>>> Martin
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Nov 27, 2024 at 11:42 AM Yang,Jie(INF)
>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> +1
>>>>>>>>>>>>> -------- 原始邮件 --------
>>>>>>>>>>>>> 发件人：Hyukjin Kwon<[email protected]>
>>>>>>>>>>>>> 时间：2024-11-27 08:04:06
>>>>>>>>>>>>> 主题：[外部邮件] Re： Spark Connect the default API in Spark 4.0
>>>>>>>>>>>>> 收件人：Bjørn Jørgensen<[email protected]>;
>>>>>>>>>>>>> 抄送人：Herman van Hovell<[email protected]>;Spark
>>>>>>>>>>>>> dev list<[email protected]>;
>>>>>>>>>>>>> +1
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, 25 Nov 2024 at 23:33, Bjørn Jørgensen <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> man. 25. nov. 2024 kl. 14:48 skrev Herman van Hovell
>>>>>>>>>>>>>> <[email protected]>:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would like to start a discussion on "Spark Connect the
>>>>>>>>>>>>>>> default API in Spark 4.0".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The rationale for this change is that Spark Connect brings a
>>>>>>>>>>>>>>> lot of improvements with respect to simplicity, stability, 
>>>>>>>>>>>>>>> isolation,
>>>>>>>>>>>>>>> upgradability, and extensibility (all detailed in the SPIP). In 
>>>>>>>>>>>>>>> a nutshell:
>>>>>>>>>>>>>>> we want to introduce a flag, spark.api.mode, that allows a
>>>>>>>>>>>>>>> user to choose between classic or connect mode, the default
>>>>>>>>>>>>>>> being connect. A user can easily fallback to Classic by
>>>>>>>>>>>>>>> setting spark.api.mode to classic.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> SPIP:
>>>>>>>>>>>>>>> https://docs.google.com/document/d/1C0kuQEliG78HujVwdnSk0wjNwHEDdwo2o8aVq7kbhTo/edit?tab=t.0#heading=h.r2c3xrbiklu3
>>>>>>>>>>>>>>> <https://mailshield.baidu.com/check?q=5uIK5BsJhkKEitTyTno8Yb7Zq%2boLHvRsgSoBr5oTNJEHXWS9Np0U8pCuv2DeJDfCQJiI52FAoCrxDEqnj1jOqX9A3jtJcetvkKkKE696xfrLfKuuRuyCC9YrwN5IW4OUtkhdHz7C%2bER2GN9EPqnlIlX2osm36Zbn>
>>>>>>>>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-50411
>>>>>>>>>>>>>>> <https://mailshield.baidu.com/check?q=vc5arXeK3OKfjk5Oxe1F%2fMNjR%2fSx5pTdbaOArWe9m2MpZDOF702CYYagPMQmbDqV7xnWwxsUdOc%3d>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am looking forward to your feedback!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Herman
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Bjørn Jørgensen
>>>>>>>>>>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>>>>>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++%0D%0ANorge?entry=gmail&source=g>
>>>>>>>>>>>>>> Norge
>>>>>>>>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++%0D%0ANorge?entry=gmail&source=g>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +47 480 94 297
>>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: [外部邮件] Re: Spark Connect the default API in Spark 4.0

Reply via email to