Re: Re: Increasing Shading & Relocating for 4.0

Holden Karau Sat, 18 Jan 2025 13:53:47 -0800

I would say the short answer is "mostly not" and the longer answer is that
the connect APIs are explicitly not covering many, what we would call,
"paved paths." Because we're more likely to have JAR conflicts with
advanced users who are more likely to use some of the non-supported APIs.
For example, some of our biggest JAR conflicts come from other platform
teams which build platforms on top of Spark (thinking custom machine
learning tools or special streaming stuff).


It's sort of that classic problem of building something for the 90% but the
10% are the ones with the actual issue your trying to avoid.

On Sat, Jan 18, 2025 at 1:26 PM Denny Lee <[email protected]> wrote:

> BTW, one of many reasons Spark Connect was developed was to potentially
> simplify this process around shading (i.e. not need to do it).   I’m
> wondering if utilizing Spark Connect could be a potential solution here?
>
>
> On Fri, Jan 17, 2025 at 12:27 Holden Karau <[email protected]> wrote:
>
>> +1 I think this is great. If you’ve got any shading you’d be open to
>> upstreaming I’d be happy to review it.
>>
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her
>>
>>
>> On Fri, Jan 17, 2025 at 12:25 PM John Zhuge <[email protected]> wrote:
>>
>>> Thanks for sharing the insightful context!
>>>
>>> On Fri, Jan 17, 2025 at 11:47 AM Regina Lee <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I’d like to share insights from our Spark team at LinkedIn. We recently
>>>> moved to a mostly shaded Spark 3 client internally. Our goal was to
>>>> minimize dependency conflicts that could hinder Spark upgrades, especially
>>>> given our previous efforts to migrate our users from Spark 2 to Spark 3,
>>>> and LinkedIn’s heavy Scala / Java use cases with complicated dependency
>>>> trees. We shaded rather aggressively (100+ relocations) given our specific
>>>> ecosystem needs – Hadoop 2.10 with no current/planned support for Spark
>>>> streaming / connect modules.
>>>>
>>>> At a high level, some notable shaded prefixes included org.json,
>>>> com.google.common / protobuf, org.apache.commons, and org.antlr. Key
>>>> dependencies *not* shaded were avro, jackson, datanucleus, logging /
>>>> JRE / scala dependencies (in general, any dependencies exposed in Spark’s /
>>>> other dependencies’ public APIs).
>>>>
>>>> There is an expected one-time cost in onboarding our Spark users to the
>>>> shaded client. Most issues require importing missing dependencies
>>>> originally provided by Spark/Hadoop. We are generally in favor of shading
>>>> more of Spark’s dependencies because it has helped reduce developer toil
>>>> and troubleshooting efforts.
>>>>
>>>> Thanks,
>>>>
>>>> Regina
>>>>
>>>> On 2024/12/07 15:30:20 Mich Talebzadeh wrote:
>>>> > General comment without specifics. I think shading should be used* on
>>>> a
>>>> > case by case basis* when the benefits outweigh the drawbacks. How
>>>> about
>>>> > exploring alternatives such as modularization, dependency management,
>>>> or
>>>> > careful dependency selection, before resorting to shading? My point
>>>> is that
>>>> > shading will introduce more debugging and testing as packages will be
>>>> > renamed impacting flexibility. Case in point, things like unit and
>>>> > integration tests may need adjustments to account for the renamed
>>>> packages.
>>>> >
>>>> > HTH
>>>> >
>>>> > Mich Talebzadeh,
>>>> >
>>>> > Architect | Data Science | Financial Crime | GDPR & Compliance
>>>> Specialist
>>>> > PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial
>>>> College
>>>> > London <https://en.wikipedia.org/wiki/Imperial_College_London>
>>>> > London, United Kingdom
>>>> >
>>>> >
>>>> >    view my Linkedin profile
>>>> > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>> >
>>>> >
>>>> >  https://en.everybodywiki.com/Mich_Talebzadeh
>>>> >
>>>> >
>>>> >
>>>> > *Disclaimer:* The information provided is correct to the best of my
>>>> > knowledge but of course cannot be guaranteed . It is essential to note
>>>> > that, as with any advice, quote "one test result is worth one-thousand
>>>> > expert opinions (Werner  <
>>>> https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
>>>> > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>> >
>>>> >
>>>> > On Sat, 7 Dec 2024 at 06:21, Holden Karau <[email protected]> wrote:
>>>> >
>>>> > > Hi Y'all,
>>>> > >
>>>> > > As we're getting closer to 4.0 I was thinking now is a good time
>>>> for us to
>>>> > > try and reduce the class path we expose for JVM users. Are there
>>>> any common
>>>> > > classes/packages folks would like to see shaded?
>>>> > >
>>>> > > Cheers,
>>>> > >
>>>> > > Holden :)
>>>> > >
>>>> > > --
>>>> > > Twitter: https://twitter.com/holdenkarau
>>>> > > Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>> > > <https://www.fighthealthinsurance.com/?q=hk_email>
>>>> > > Books (Learning Spark, High Performance Spark, etc.):
>>>> > > https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> > > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> > > Pronouns: she/her
>>>> > >
>>>> >
>>>>
>>>
>>>
>>> --
>>> John Zhuge
>>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her

Re: Re: Increasing Shading & Relocating for 4.0

Reply via email to