RE: Re: Increasing Shading & Relocating for 4.0

Jules Damji Sat, 18 Jan 2025 19:07:17 -0800

On 2025/01/18 22:35:59 Mich Talebzadeh wrote:
> I think your view highlights the need for a shift towards more stable and
> version-independent APIs. Spark Connect IMO is a key enabler of this shift,
> allowing users and developers to build applications and libraries that are
> more resilient to changes in Spark's internals as opposed to RDDs.
>


Yup. The vital phrase here is “stable and version-independent-APIs”.

> As I stated before, maintaining backward compatibility for the existing
> RDD-based applications and libraries is crucial during this transition so
> timeframe is another factor for consideration.
> 

This is a crucially vital for long-term transition, and as Holden rightfully 
points out the 90/10 issues for that 10% 

Cheers
Jules

> 
> Mich
> 
> 
> On Sat, 18 Jan 2025 at 22:10, Matei Zaharia <[email protected]> wrote:
> 
> > We definitely need to move the “advanced” users to stable APIs if we want
> > Spark to have a good future, such as the Spark Connect plugin APIs. The RDD
> > API was the wrong abstraction in my opinion — hopefully I can say that
> > since I worked on it. It was too tightly bound to Java types and to
> > internal details. I’d love to see more suggestions to help us get away from
> > these low-level Java APIs toward things that automatically can be ported
> > forward with new Spark versions. Why do platform teams building a new ML
> > library have to use internal APIs for example? They should use the same
> > query plan + UDF interface that most user code uses, and this would save
> > them a ton of maintenance time going forward and help their users benefit
> > from the latest changes in Spark (or at least test them) much sooner.
> >
> > I’m not sure whether it’s clear, but this was absolutely the #1 goal of
> > Spark Connect to me — it was to give Spark users and library developers
> > (data sources, algorithms, etc) the freedom not to worry about Spark
> > version updates. Otherwise, all these projects will eventually be replaced
> > by platforms that don’t require you to worry about versions or wait for
> > your platform team to port a ton of jobs and libraries over for each
> > release. Even shading only gets you so far, and it introduces new problems
> > that other users might not like.
> >
> > On Jan 18, 2025, at 1:42 PM, Holden Karau <[email protected]> wrote:
> >
> > I would say the short answer is "mostly not" and the longer answer is that
> > the connect APIs are explicitly not covering many, what we would call,
> > "paved paths." Because we're more likely to have JAR conflicts with
> > advanced users who are more likely to use some of the non-supported APIs.
> > For example, some of our biggest JAR conflicts come from other platform
> > teams which build platforms on top of Spark (thinking custom machine
> > learning tools or special streaming stuff).
> >
> > It's sort of that classic problem of building something for the 90% but
> > the 10% are the ones with the actual issue your trying to avoid.
> >
> > On Sat, Jan 18, 2025 at 1:26 PM Denny Lee <[email protected]> wrote:
> >
> >> BTW, one of many reasons Spark Connect was developed was to potentially
> >> simplify this process around shading (i.e. not need to do it).   I’m
> >> wondering if utilizing Spark Connect could be a potential solution here?
> >>
> >>
> >> On Fri, Jan 17, 2025 at 12:27 Holden Karau <[email protected]>
> >> wrote:
> >>
> >>> +1 I think this is great. If you’ve got any shading you’d be open to
> >>> upstreaming I’d be happy to review it.
> >>>
> >>> Twitter: https://twitter.com/holdenkarau
> >>> Fight Health Insurance: https://www.fighthealthinsurance.com/
> >>> <https://www.fighthealthinsurance.com/?q=hk_email>
> >>> Books (Learning Spark, High Performance Spark, etc.):
> >>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >>> Pronouns: she/her
> >>>
> >>>
> >>> On Fri, Jan 17, 2025 at 12:25 PM John Zhuge <[email protected]> wrote:
> >>>
> >>>> Thanks for sharing the insightful context!
> >>>>
> >>>> On Fri, Jan 17, 2025 at 11:47 AM Regina Lee <[email protected]>
> >>>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I’d like to share insights from our Spark team at LinkedIn. We
> >>>>> recently moved to a mostly shaded Spark 3 client internally. Our goal 
> >>>>> was
> >>>>> to minimize dependency conflicts that could hinder Spark upgrades,
> >>>>> especially given our previous efforts to migrate our users from Spark 2 
> >>>>> to
> >>>>> Spark 3, and LinkedIn’s heavy Scala / Java use cases with complicated
> >>>>> dependency trees. We shaded rather aggressively (100+ relocations) given
> >>>>> our specific ecosystem needs – Hadoop 2.10 with no current/planned 
> >>>>> support
> >>>>> for Spark streaming / connect modules.
> >>>>>
> >>>>> At a high level, some notable shaded prefixes included org.json,
> >>>>> com.google.common / protobuf, org.apache.commons, and org.antlr. Key
> >>>>> dependencies *not* shaded were avro, jackson, datanucleus, logging /
> >>>>> JRE / scala dependencies (in general, any dependencies exposed in 
> >>>>> Spark’s /
> >>>>> other dependencies’ public APIs).
> >>>>>
> >>>>> There is an expected one-time cost in onboarding our Spark users to
> >>>>> the shaded client. Most issues require importing missing dependencies
> >>>>> originally provided by Spark/Hadoop. We are generally in favor of 
> >>>>> shading
> >>>>> more of Spark’s dependencies because it has helped reduce developer toil
> >>>>> and troubleshooting efforts.
> >>>>>
> >>>>> Thanks,
> >>>>> Regina
> >>>>>
> >>>>> On 2024/12/07 15:30:20 Mich Talebzadeh wrote:
> >>>>> > General comment without specifics. I think shading should be used*
> >>>>> on a
> >>>>> > case by case basis* when the benefits outweigh the drawbacks. How
> >>>>> about
> >>>>> > exploring alternatives such as modularization, dependency
> >>>>> management, or
> >>>>> > careful dependency selection, before resorting to shading? My point
> >>>>> is that
> >>>>> > shading will introduce more debugging and testing as packages will be
> >>>>> > renamed impacting flexibility. Case in point, things like unit and
> >>>>> > integration tests may need adjustments to account for the renamed
> >>>>> packages.
> >>>>> >
> >>>>> > HTH
> >>>>> >
> >>>>> > Mich Talebzadeh,
> >>>>> >
> >>>>> > Architect | Data Science | Financial Crime | GDPR & Compliance
> >>>>> Specialist
> >>>>> > PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial
> >>>>> College
> >>>>> > London <https://en.wikipedia.org/wiki/Imperial_College_London>
> >>>>> > London, United Kingdom
> >>>>> >
> >>>>> >
> >>>>> >    view my Linkedin profile
> >>>>> > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> >>>>> >
> >>>>> >
> >>>>> >  https://en.everybodywiki.com/Mich_Talebzadeh
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > *Disclaimer:* The information provided is correct to the best of my
> >>>>> > knowledge but of course cannot be guaranteed . It is essential to
> >>>>> note
> >>>>> > that, as with any advice, quote "one test result is worth
> >>>>> one-thousand
> >>>>> > expert opinions (Werner  <
> >>>>> https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> >>>>> > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
> >>>>> >
> >>>>> >
> >>>>> > On Sat, 7 Dec 2024 at 06:21, Holden Karau <[email protected]> wrote:
> >>>>> >
> >>>>> > > Hi Y'all,
> >>>>> > >
> >>>>> > > As we're getting closer to 4.0 I was thinking now is a good time
> >>>>> for us to
> >>>>> > > try and reduce the class path we expose for JVM users. Are there
> >>>>> any common
> >>>>> > > classes/packages folks would like to see shaded?
> >>>>> > >
> >>>>> > > Cheers,
> >>>>> > >
> >>>>> > > Holden :)
> >>>>> > >
> >>>>> > > --
> >>>>> > > Twitter: https://twitter.com/holdenkarau
> >>>>> > > Fight Health Insurance: https://www.fighthealthinsurance.com/
> >>>>> > > <https://www.fighthealthinsurance.com/?q=hk_email>
> >>>>> > > Books (Learning Spark, High Performance Spark, etc.):
> >>>>> > > https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> >>>>> > > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >>>>> > > Pronouns: she/her
> >>>>> > >
> >>>>> >
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> John Zhuge
> >>>>
> >>>
> >
> > --
> > Twitter: https://twitter.com/holdenkarau
> > Fight Health Insurance: https://www.fighthealthinsurance.com/
> > <https://www.fighthealthinsurance.com/?q=hk_email>
> > Books (Learning Spark, High Performance Spark, etc.):
> > https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> > Pronouns: she/her
> >
> >
> >
> 
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

RE: Re: Increasing Shading & Relocating for 4.0

Reply via email to