Re: Increasing Shading & Relocating for 4.0

Matei Zaharia Sat, 18 Jan 2025 15:03:07 -0800

Yup, it will definitely take a while, but I’d love to start tracing down the 
things that prevent people from moving (RDD API is one, but I’m worried there 
are also other internal hooks), and also start encouraging library and plugin 
developers to use more forward-compatible APIs. Hopefully we can get there. 
Having Spark Connect used by default in at least some languages, or recommended 
for use cases like this, could help with that.


> On Jan 18, 2025, at 2:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> I think your view highlights the need for a shift towards more stable and 
> version-independent APIs. Spark Connect IMO is a key enabler of this shift, 
> allowing users and developers to build applications and libraries that are 
> more resilient to changes in Spark's internals as opposed to RDDs.
> 
> As I stated before, maintaining backward compatibility for the existing 
> RDD-based applications and libraries is crucial during this transition so 
> timeframe is another factor for consideration.
> 
> HTH,
> 
> Mich
> 
> 
> On Sat, 18 Jan 2025 at 22:10, Matei Zaharia <matei.zaha...@gmail.com 
> <mailto:matei.zaha...@gmail.com>> wrote:
>> We definitely need to move the “advanced” users to stable APIs if we want 
>> Spark to have a good future, such as the Spark Connect plugin APIs. The RDD 
>> API was the wrong abstraction in my opinion — hopefully I can say that since 
>> I worked on it. It was too tightly bound to Java types and to internal 
>> details. I’d love to see more suggestions to help us get away from these 
>> low-level Java APIs toward things that automatically can be ported forward 
>> with new Spark versions. Why do platform teams building a new ML library 
>> have to use internal APIs for example? They should use the same query plan + 
>> UDF interface that most user code uses, and this would save them a ton of 
>> maintenance time going forward and help their users benefit from the latest 
>> changes in Spark (or at least test them) much sooner.
>> 
>> I’m not sure whether it’s clear, but this was absolutely the #1 goal of 
>> Spark Connect to me — it was to give Spark users and library developers 
>> (data sources, algorithms, etc) the freedom not to worry about Spark version 
>> updates. Otherwise, all these projects will eventually be replaced by 
>> platforms that don’t require you to worry about versions or wait for your 
>> platform team to port a ton of jobs and libraries over for each release. 
>> Even shading only gets you so far, and it introduces new problems that other 
>> users might not like.
>> 
>>> On Jan 18, 2025, at 1:42 PM, Holden Karau <holden.ka...@gmail.com 
>>> <mailto:holden.ka...@gmail.com>> wrote:
>>> 
>>> I would say the short answer is "mostly not" and the longer answer is that 
>>> the connect APIs are explicitly not covering many, what we would call, 
>>> "paved paths." Because we're more likely to have JAR conflicts with 
>>> advanced users who are more likely to use some of the non-supported APIs. 
>>> For example, some of our biggest JAR conflicts come from other platform 
>>> teams which build platforms on top of Spark (thinking custom machine 
>>> learning tools or special streaming stuff).
>>> 
>>> It's sort of that classic problem of building something for the 90% but the 
>>> 10% are the ones with the actual issue your trying to avoid.
>>> 
>>> On Sat, Jan 18, 2025 at 1:26 PM Denny Lee <denny.g....@gmail.com 
>>> <mailto:denny.g....@gmail.com>> wrote:
>>>> BTW, one of many reasons Spark Connect was developed was to potentially 
>>>> simplify this process around shading (i.e. not need to do it).   I’m 
>>>> wondering if utilizing Spark Connect could be a potential solution here?  
>>>> 
>>>> 
>>>> On Fri, Jan 17, 2025 at 12:27 Holden Karau <holden.ka...@gmail.com 
>>>> <mailto:holden.ka...@gmail.com>> wrote:
>>>>> +1 I think this is great. If you’ve got any shading you’d be open to 
>>>>> upstreaming I’d be happy to review it.
>>>>> 
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ 
>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>> Books (Learning Spark, High Performance Spark, etc.): 
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>> Pronouns: she/her
>>>>> 
>>>>> 
>>>>> On Fri, Jan 17, 2025 at 12:25 PM John Zhuge <jzh...@apache.org 
>>>>> <mailto:jzh...@apache.org>> wrote:
>>>>>> Thanks for sharing the insightful context!
>>>>>> 
>>>>>> On Fri, Jan 17, 2025 at 11:47 AM Regina Lee <re...@linkedin.com.invalid> 
>>>>>> wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I’d like to share insights from our Spark team at LinkedIn. We recently 
>>>>>>> moved to a mostly shaded Spark 3 client internally. Our goal was to 
>>>>>>> minimize dependency conflicts that could hinder Spark upgrades, 
>>>>>>> especially given our previous efforts to migrate our users from Spark 2 
>>>>>>> to Spark 3, and LinkedIn’s heavy Scala / Java use cases with 
>>>>>>> complicated dependency trees. We shaded rather aggressively (100+ 
>>>>>>> relocations) given our specific ecosystem needs – Hadoop 2.10 with no 
>>>>>>> current/planned support for Spark streaming / connect modules. 
>>>>>>> 
>>>>>>> At a high level, some notable shaded prefixes included org.json, 
>>>>>>> com.google.common / protobuf, org.apache.commons, and org.antlr. Key 
>>>>>>> dependencies not shaded were avro, jackson, datanucleus, logging / JRE 
>>>>>>> / scala dependencies (in general, any dependencies exposed in Spark’s / 
>>>>>>> other dependencies’ public APIs).
>>>>>>> 
>>>>>>> There is an expected one-time cost in onboarding our Spark users to the 
>>>>>>> shaded client. Most issues require importing missing dependencies 
>>>>>>> originally provided by Spark/Hadoop. We are generally in favor of 
>>>>>>> shading more of Spark’s dependencies because it has helped reduce 
>>>>>>> developer toil and troubleshooting efforts.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Regina
>>>>>>> 
>>>>>>> On 2024/12/07 15:30:20 Mich Talebzadeh wrote:
>>>>>>> > General comment without specifics. I think shading should be used* on 
>>>>>>> > a
>>>>>>> > case by case basis* when the benefits outweigh the drawbacks. How 
>>>>>>> > about
>>>>>>> > exploring alternatives such as modularization, dependency management, 
>>>>>>> > or
>>>>>>> > careful dependency selection, before resorting to shading? My point 
>>>>>>> > is that
>>>>>>> > shading will introduce more debugging and testing as packages will be
>>>>>>> > renamed impacting flexibility. Case in point, things like unit and
>>>>>>> > integration tests may need adjustments to account for the renamed 
>>>>>>> > packages.
>>>>>>> >
>>>>>>> > HTH
>>>>>>> >
>>>>>>> > Mich Talebzadeh,
>>>>>>> >
>>>>>>> > Architect | Data Science | Financial Crime | GDPR & Compliance 
>>>>>>> > Specialist
>>>>>>> > PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> 
>>>>>>> > <https://en.wikipedia.org/wiki/Doctor_of_Philosophy%3E> Imperial 
>>>>>>> > College
>>>>>>> > London <https://en.wikipedia.org/wiki/Imperial_College_London> 
>>>>>>> > <https://en.wikipedia.org/wiki/Imperial_College_London%3E>
>>>>>>> > London, United Kingdom
>>>>>>> >
>>>>>>> >
>>>>>>> >    view my Linkedin profile
>>>>>>> > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> 
>>>>>>> > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/%3E>
>>>>>>> >
>>>>>>> >
>>>>>>> >  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > *Disclaimer:* The information provided is correct to the best of my
>>>>>>> > knowledge but of course cannot be guaranteed . It is essential to note
>>>>>>> > that, as with any advice, quote "one test result is worth one-thousand
>>>>>>> > expert opinions (Werner  
>>>>>>> > <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von 
>>>>>>> > <https://en.wikipedia.org/wiki/Wernher_von_Braun%3EVon>
>>>>>>> > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun> 
>>>>>>> > <https://en.wikipedia.org/wiki/Wernher_von_Braun%3E>)".
>>>>>>> >
>>>>>>> >
>>>>>>> > On Sat, 7 Dec 2024 at 06:21, Holden Karau <ho...@gmail.com 
>>>>>>> > <mailto:ho...@gmail.com>> wrote:
>>>>>>> >
>>>>>>> > > Hi Y'all,
>>>>>>> > >
>>>>>>> > > As we're getting closer to 4.0 I was thinking now is a good time 
>>>>>>> > > for us to
>>>>>>> > > try and reduce the class path we expose for JVM users. Are there 
>>>>>>> > > any common
>>>>>>> > > classes/packages folks would like to see shaded?
>>>>>>> > >
>>>>>>> > > Cheers,
>>>>>>> > >
>>>>>>> > > Holden :)
>>>>>>> > >
>>>>>>> > > --
>>>>>>> > > Twitter: https://twitter.com/holdenkarau
>>>>>>> > > Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>> > > <https://www.fighthealthinsurance.com/?q=hk_email> 
>>>>>>> > > <https://www.fighthealthinsurance.com/?q=hk_email%3E>
>>>>>>> > > Books (Learning Spark, High Performance Spark, etc.):
>>>>>>> > > https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9> 
>>>>>>> > > <https://amzn.to/2MaRAG9%C2%A0%C2%A0%3Chttps://amzn.to/2MaRAG9%3E>
>>>>>>> > > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>> > > Pronouns: she/her
>>>>>>> > >
>>>>>>> >
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> John Zhuge
>>> 
>>> 
>>> 
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ 
>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>> Books (Learning Spark, High Performance Spark, etc.): 
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> Pronouns: she/her
>>

Re: Increasing Shading & Relocating for 4.0

Reply via email to