Re: Re: Increasing Shading & Relocating for 4.0

Ángel Sun, 19 Jan 2025 06:05:28 -0800

The higher the level of abstraction, the less control and insight you
typically have into its internal workings. If the goal is to create users
rather than developers, Spark Connect is the right API to achieve that
purpose.


El dom, 19 ene 2025, 13:10, Mich Talebzadeh <mich.talebza...@gmail.com>
escribió:

> I believe by actively involving the user community, we can create a more
> user-centric and successful path for the future of Spark in this respect.
> At the moment, the discussion is confined to this dev group but we ought to
> gather feedback from trenches so we can gain a sense of exit barriers and
> timelines from the evolution of Spark APIs, particularly with respect to
> the transition from RDDs to newer APIs like Spark Connect. For example, we
> will need
>
>    - Clear and comprehensive migration guides and resources to help
>    transition from older APIs (like RDDs) to newer APIs (like Spark Connect),
>    - So that legacy applications can take advantage of the latest
>    features and improvements in Spark.
>
> Regrettably some users have come to the conclusion that Spark has become
> essentially an ETL tool. I differ on this view that knowing how to use
> Spark for slicing and dicing on multiple flavours of storage through
> available Spark plugins -- without changing Python or Scala code much, is a
> great feature. Moreover, the inclusion of data science libraries in Spark
> will add further value for the next stage Spark and position it as a
> powerful and versatile platform beyond ETL.
>
> HTH
>
> Mich Talebzadeh,
>
> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Sun, 19 Jan 2025 at 03:07, Jules Damji <jules.da...@gmail.com> wrote:
>
>> On 2025/01/18 22:35:59 Mich Talebzadeh wrote:
>> > I think your view highlights the need for a shift towards more stable
>> and
>> > version-independent APIs. Spark Connect IMO is a key enabler of this
>> shift,
>> > allowing users and developers to build applications and libraries that
>> are
>> > more resilient to changes in Spark's internals as opposed to RDDs.
>> >
>>
>> Yup. The vital phrase here is “stable and version-independent-APIs”.
>>
>> > As I stated before, maintaining backward compatibility for the existing
>> > RDD-based applications and libraries is crucial during this transition
>> so
>> > timeframe is another factor for consideration.
>> >
>>
>> This is a crucially vital for long-term transition, and as Holden
>> rightfully points out the 90/10 issues for that 10%
>>
>> Cheers
>> Jules
>>
>> >
>> > Mich
>> >
>> >
>> > On Sat, 18 Jan 2025 at 22:10, Matei Zaharia <ma...@gmail.com> wrote:
>> >
>> > > We definitely need to move the “advanced” users to stable APIs if we
>> want
>> > > Spark to have a good future, such as the Spark Connect plugin APIs.
>> The RDD
>> > > API was the wrong abstraction in my opinion — hopefully I can say that
>> > > since I worked on it. It was too tightly bound to Java types and to
>> > > internal details. I’d love to see more suggestions to help us get
>> away from
>> > > these low-level Java APIs toward things that automatically can be
>> ported
>> > > forward with new Spark versions. Why do platform teams building a new
>> ML
>> > > library have to use internal APIs for example? They should use the
>> same
>> > > query plan + UDF interface that most user code uses, and this would
>> save
>> > > them a ton of maintenance time going forward and help their users
>> benefit
>> > > from the latest changes in Spark (or at least test them) much sooner.
>> > >
>> > > I’m not sure whether it’s clear, but this was absolutely the #1 goal
>> of
>> > > Spark Connect to me — it was to give Spark users and library
>> developers
>> > > (data sources, algorithms, etc) the freedom not to worry about Spark
>> > > version updates. Otherwise, all these projects will eventually be
>> replaced
>> > > by platforms that don’t require you to worry about versions or wait
>> for
>> > > your platform team to port a ton of jobs and libraries over for each
>> > > release. Even shading only gets you so far, and it introduces new
>> problems
>> > > that other users might not like.
>> > >
>> > > On Jan 18, 2025, at 1:42 PM, Holden Karau <ho...@gmail.com> wrote:
>> > >
>> > > I would say the short answer is "mostly not" and the longer answer is
>> that
>> > > the connect APIs are explicitly not covering many, what we would call,
>> > > "paved paths." Because we're more likely to have JAR conflicts with
>> > > advanced users who are more likely to use some of the non-supported
>> APIs.
>> > > For example, some of our biggest JAR conflicts come from other
>> platform
>> > > teams which build platforms on top of Spark (thinking custom machine
>> > > learning tools or special streaming stuff).
>> > >
>> > > It's sort of that classic problem of building something for the 90%
>> but
>> > > the 10% are the ones with the actual issue your trying to avoid.
>> > >
>> > > On Sat, Jan 18, 2025 at 1:26 PM Denny Lee <de...@gmail.com> wrote:
>> > >
>> > >> BTW, one of many reasons Spark Connect was developed was to
>> potentially
>> > >> simplify this process around shading (i.e. not need to do it).   I’m
>> > >> wondering if utilizing Spark Connect could be a potential solution
>> here?
>> > >>
>> > >>
>> > >> On Fri, Jan 17, 2025 at 12:27 Holden Karau <ho...@gmail.com>
>> > >> wrote:
>> > >>
>> > >>> +1 I think this is great. If you’ve got any shading you’d be open to
>> > >>> upstreaming I’d be happy to review it.
>> > >>>
>> > >>> Twitter: https://twitter.com/holdenkarau
>> > >>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> > >>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> > >>> Books (Learning Spark, High Performance Spark, etc.):
>> > >>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> > >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> > >>> Pronouns: she/her
>> > >>>
>> > >>>
>> > >>> On Fri, Jan 17, 2025 at 12:25 PM John Zhuge <jz...@apache.org>
>> wrote:
>> > >>>
>> > >>>> Thanks for sharing the insightful context!
>> > >>>>
>> > >>>> On Fri, Jan 17, 2025 at 11:47 AM Regina Lee
>> <re...@linkedin.com.invalid>
>> > >>>> wrote:
>> > >>>>
>> > >>>>> Hi,
>> > >>>>>
>> > >>>>> I’d like to share insights from our Spark team at LinkedIn. We
>> > >>>>> recently moved to a mostly shaded Spark 3 client internally. Our
>> goal was
>> > >>>>> to minimize dependency conflicts that could hinder Spark upgrades,
>> > >>>>> especially given our previous efforts to migrate our users from
>> Spark 2 to
>> > >>>>> Spark 3, and LinkedIn’s heavy Scala / Java use cases with
>> complicated
>> > >>>>> dependency trees. We shaded rather aggressively (100+
>> relocations) given
>> > >>>>> our specific ecosystem needs – Hadoop 2.10 with no
>> current/planned support
>> > >>>>> for Spark streaming / connect modules.
>> > >>>>>
>> > >>>>> At a high level, some notable shaded prefixes included org.json,
>> > >>>>> com.google.common / protobuf, org.apache.commons, and org.antlr.
>> Key
>> > >>>>> dependencies *not* shaded were avro, jackson, datanucleus,
>> logging /
>> > >>>>> JRE / scala dependencies (in general, any dependencies exposed in
>> Spark’s /
>> > >>>>> other dependencies’ public APIs).
>> > >>>>>
>> > >>>>> There is an expected one-time cost in onboarding our Spark users
>> to
>> > >>>>> the shaded client. Most issues require importing missing
>> dependencies
>> > >>>>> originally provided by Spark/Hadoop. We are generally in favor of
>> shading
>> > >>>>> more of Spark’s dependencies because it has helped reduce
>> developer toil
>> > >>>>> and troubleshooting efforts.
>> > >>>>>
>> > >>>>> Thanks,
>> > >>>>> Regina
>> > >>>>>
>> > >>>>> On 2024/12/07 15:30:20 Mich Talebzadeh wrote:
>> > >>>>> > General comment without specifics. I think shading should be
>> used*
>> > >>>>> on a
>> > >>>>> > case by case basis* when the benefits outweigh the drawbacks.
>> How
>> > >>>>> about
>> > >>>>> > exploring alternatives such as modularization, dependency
>> > >>>>> management, or
>> > >>>>> > careful dependency selection, before resorting to shading? My
>> point
>> > >>>>> is that
>> > >>>>> > shading will introduce more debugging and testing as packages
>> will be
>> > >>>>> > renamed impacting flexibility. Case in point, things like unit
>> and
>> > >>>>> > integration tests may need adjustments to account for the
>> renamed
>> > >>>>> packages.
>> > >>>>> >
>> > >>>>> > HTH
>> > >>>>> >
>> > >>>>> > Mich Talebzadeh,
>> > >>>>> >
>> > >>>>> > Architect | Data Science | Financial Crime | GDPR & Compliance
>> > >>>>> Specialist
>> > >>>>> > PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy>
>> Imperial
>> > >>>>> College
>> > >>>>> > London <https://en.wikipedia.org/wiki/Imperial_College_London>
>> > >>>>> > London, United Kingdom
>> > >>>>> >
>> > >>>>> >
>> > >>>>> >    view my Linkedin profile
>> > >>>>> > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>> > >>>>> >
>> > >>>>> >
>> > >>>>> >  https://en.everybodywiki.com/Mich_Talebzadeh
>> > >>>>> >
>> > >>>>> >
>> > >>>>> >
>> > >>>>> > *Disclaimer:* The information provided is correct to the best
>> of my
>> > >>>>> > knowledge but of course cannot be guaranteed . It is essential
>> to
>> > >>>>> note
>> > >>>>> > that, as with any advice, quote "one test result is worth
>> > >>>>> one-thousand
>> > >>>>> > expert opinions (Werner  <
>> > >>>>> https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
>> > >>>>> > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>> > >>>>> >
>> > >>>>> >
>> > >>>>> > On Sat, 7 Dec 2024 at 06:21, Holden Karau <ho...@gmail.com>
>> wrote:
>> > >>>>> >
>> > >>>>> > > Hi Y'all,
>> > >>>>> > >
>> > >>>>> > > As we're getting closer to 4.0 I was thinking now is a good
>> time
>> > >>>>> for us to
>> > >>>>> > > try and reduce the class path we expose for JVM users. Are
>> there
>> > >>>>> any common
>> > >>>>> > > classes/packages folks would like to see shaded?
>> > >>>>> > >
>> > >>>>> > > Cheers,
>> > >>>>> > >
>> > >>>>> > > Holden :)
>> > >>>>> > >
>> > >>>>> > > --
>> > >>>>> > > Twitter: https://twitter.com/holdenkarau
>> > >>>>> > > Fight Health Insurance: https://www.fighthealthinsurance.com/
>> > >>>>> > > <https://www.fighthealthinsurance.com/?q=hk_email>
>> > >>>>> > > Books (Learning Spark, High Performance Spark, etc.):
>> > >>>>> > > https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> > >>>>> > > YouTube Live Streams:
>> https://www.youtube.com/user/holdenkarau
>> > >>>>> > > Pronouns: she/her
>> > >>>>> > >
>> > >>>>> >
>> > >>>>>
>> > >>>>
>> > >>>>
>> > >>>> --
>> > >>>> John Zhuge
>> > >>>>
>> > >>>
>> > >
>> > > --
>> > > Twitter: https://twitter.com/holdenkarau
>> > > Fight Health Insurance: https://www.fighthealthinsurance.com/
>> > > <https://www.fighthealthinsurance.com/?q=hk_email>
>> > > Books (Learning Spark, High Performance Spark, etc.):
>> > > https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> > > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> > > Pronouns: she/her
>> > >
>> > >
>> > >
>> >
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: Re: Increasing Shading & Relocating for 4.0

Reply via email to