Re: Increasing Shading & Relocating for 4.0

Denny Lee Sat, 18 Jan 2025 15:15:12 -0800

That's what I'm hoping for  - that going forward we can have more non-JVM
clients (Python, GoLang, Rust, etc.) and make it simpler for JVM-based
clients.  I appreciate your call out on 90%/10% Holden - completely fair.
I guess I would just  love to see more traction on this so that way we can
minimize the need of shading, eh?!


On Sat, Jan 18, 2025 at 3:03 PM Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> Yup, it will definitely take a while, but I’d love to start tracing down
> the things that prevent people from moving (RDD API is one, but I’m worried
> there are also other internal hooks), and also start encouraging library
> and plugin developers to use more forward-compatible APIs. Hopefully we can
> get there. Having Spark Connect used by default in at least some languages,
> or recommended for use cases like this, could help with that.
>
> On Jan 18, 2025, at 2:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> I think your view highlights the need for a shift towards more stable and
> version-independent APIs. Spark Connect IMO is a key enabler of this shift,
> allowing users and developers to build applications and libraries that are
> more resilient to changes in Spark's internals as opposed to RDDs.
>
> As I stated before, maintaining backward compatibility for the existing
> RDD-based applications and libraries is crucial during this transition so
> timeframe is another factor for consideration.
>
> HTH,
>
> Mich
>
>
> On Sat, 18 Jan 2025 at 22:10, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
>
>> We definitely need to move the “advanced” users to stable APIs if we want
>> Spark to have a good future, such as the Spark Connect plugin APIs. The RDD
>> API was the wrong abstraction in my opinion — hopefully I can say that
>> since I worked on it. It was too tightly bound to Java types and to
>> internal details. I’d love to see more suggestions to help us get away from
>> these low-level Java APIs toward things that automatically can be ported
>> forward with new Spark versions. Why do platform teams building a new ML
>> library have to use internal APIs for example? They should use the same
>> query plan + UDF interface that most user code uses, and this would save
>> them a ton of maintenance time going forward and help their users benefit
>> from the latest changes in Spark (or at least test them) much sooner.
>>
>> I’m not sure whether it’s clear, but this was absolutely the #1 goal of
>> Spark Connect to me — it was to give Spark users and library developers
>> (data sources, algorithms, etc) the freedom not to worry about Spark
>> version updates. Otherwise, all these projects will eventually be replaced
>> by platforms that don’t require you to worry about versions or wait for
>> your platform team to port a ton of jobs and libraries over for each
>> release. Even shading only gets you so far, and it introduces new problems
>> that other users might not like.
>>
>> On Jan 18, 2025, at 1:42 PM, Holden Karau <holden.ka...@gmail.com> wrote:
>>
>> I would say the short answer is "mostly not" and the longer answer is
>> that the connect APIs are explicitly not covering many, what we would call,
>> "paved paths." Because we're more likely to have JAR conflicts with
>> advanced users who are more likely to use some of the non-supported APIs.
>> For example, some of our biggest JAR conflicts come from other platform
>> teams which build platforms on top of Spark (thinking custom machine
>> learning tools or special streaming stuff).
>>
>> It's sort of that classic problem of building something for the 90% but
>> the 10% are the ones with the actual issue your trying to avoid.
>>
>> On Sat, Jan 18, 2025 at 1:26 PM Denny Lee <denny.g....@gmail.com> wrote:
>>
>>> BTW, one of many reasons Spark Connect was developed was to potentially
>>> simplify this process around shading (i.e. not need to do it).   I’m
>>> wondering if utilizing Spark Connect could be a potential solution here?
>>>
>>>
>>> On Fri, Jan 17, 2025 at 12:27 Holden Karau <holden.ka...@gmail.com>
>>> wrote:
>>>
>>>> +1 I think this is great. If you’ve got any shading you’d be open to
>>>> upstreaming I’d be happy to review it.
>>>>
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> Pronouns: she/her
>>>>
>>>>
>>>> On Fri, Jan 17, 2025 at 12:25 PM John Zhuge <jzh...@apache.org> wrote:
>>>>
>>>>> Thanks for sharing the insightful context!
>>>>>
>>>>> On Fri, Jan 17, 2025 at 11:47 AM Regina Lee <re...@linkedin.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I’d like to share insights from our Spark team at LinkedIn. We
>>>>>> recently moved to a mostly shaded Spark 3 client internally. Our goal was
>>>>>> to minimize dependency conflicts that could hinder Spark upgrades,
>>>>>> especially given our previous efforts to migrate our users from Spark 2 
>>>>>> to
>>>>>> Spark 3, and LinkedIn’s heavy Scala / Java use cases with complicated
>>>>>> dependency trees. We shaded rather aggressively (100+ relocations) given
>>>>>> our specific ecosystem needs – Hadoop 2.10 with no current/planned 
>>>>>> support
>>>>>> for Spark streaming / connect modules.
>>>>>>
>>>>>> At a high level, some notable shaded prefixes included org.json,
>>>>>> com.google.common / protobuf, org.apache.commons, and org.antlr. Key
>>>>>> dependencies *not* shaded were avro, jackson, datanucleus, logging /
>>>>>> JRE / scala dependencies (in general, any dependencies exposed in 
>>>>>> Spark’s /
>>>>>> other dependencies’ public APIs).
>>>>>>
>>>>>> There is an expected one-time cost in onboarding our Spark users to
>>>>>> the shaded client. Most issues require importing missing dependencies
>>>>>> originally provided by Spark/Hadoop. We are generally in favor of shading
>>>>>> more of Spark’s dependencies because it has helped reduce developer toil
>>>>>> and troubleshooting efforts.
>>>>>>
>>>>>> Thanks,
>>>>>> Regina
>>>>>>
>>>>>> On 2024/12/07 15:30:20 Mich Talebzadeh wrote:
>>>>>> > General comment without specifics. I think shading should be used*
>>>>>> on a
>>>>>> > case by case basis* when the benefits outweigh the drawbacks. How
>>>>>> about
>>>>>> > exploring alternatives such as modularization, dependency
>>>>>> management, or
>>>>>> > careful dependency selection, before resorting to shading? My point
>>>>>> is that
>>>>>> > shading will introduce more debugging and testing as packages will
>>>>>> be
>>>>>> > renamed impacting flexibility. Case in point, things like unit and
>>>>>> > integration tests may need adjustments to account for the renamed
>>>>>> packages.
>>>>>> >
>>>>>> > HTH
>>>>>> >
>>>>>> > Mich Talebzadeh,
>>>>>> >
>>>>>> > Architect | Data Science | Financial Crime | GDPR & Compliance
>>>>>> Specialist
>>>>>> > PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial
>>>>>> College
>>>>>> > London <https://en.wikipedia.org/wiki/Imperial_College_London>
>>>>>> > London, United Kingdom
>>>>>> >
>>>>>> >
>>>>>> >    view my Linkedin profile
>>>>>> > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>> >
>>>>>> >
>>>>>> >  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > *Disclaimer:* The information provided is correct to the best of my
>>>>>> > knowledge but of course cannot be guaranteed . It is essential to
>>>>>> note
>>>>>> > that, as with any advice, quote "one test result is worth
>>>>>> one-thousand
>>>>>> > expert opinions (Werner  <
>>>>>> https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
>>>>>> > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>> >
>>>>>> >
>>>>>> > On Sat, 7 Dec 2024 at 06:21, Holden Karau <ho...@gmail.com> wrote:
>>>>>> >
>>>>>> > > Hi Y'all,
>>>>>> > >
>>>>>> > > As we're getting closer to 4.0 I was thinking now is a good time
>>>>>> for us to
>>>>>> > > try and reduce the class path we expose for JVM users. Are there
>>>>>> any common
>>>>>> > > classes/packages folks would like to see shaded?
>>>>>> > >
>>>>>> > > Cheers,
>>>>>> > >
>>>>>> > > Holden :)
>>>>>> > >
>>>>>> > > --
>>>>>> > > Twitter: https://twitter.com/holdenkarau
>>>>>> > > Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>> > > <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>> > > Books (Learning Spark, High Performance Spark, etc.):
>>>>>> > > https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> > > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>> > > Pronouns: she/her
>>>>>> > >
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> John Zhuge
>>>>>
>>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her
>>
>>
>>
>

Re: Increasing Shading & Relocating for 4.0

Reply via email to