That's what I'm hoping for - that going forward we can have more non-JVM clients (Python, GoLang, Rust, etc.) and make it simpler for JVM-based clients. I appreciate your call out on 90%/10% Holden - completely fair. I guess I would just love to see more traction on this so that way we can minimize the need of shading, eh?!
On Sat, Jan 18, 2025 at 3:03 PM Matei Zaharia <matei.zaha...@gmail.com> wrote: > Yup, it will definitely take a while, but I’d love to start tracing down > the things that prevent people from moving (RDD API is one, but I’m worried > there are also other internal hooks), and also start encouraging library > and plugin developers to use more forward-compatible APIs. Hopefully we can > get there. Having Spark Connect used by default in at least some languages, > or recommended for use cases like this, could help with that. > > On Jan 18, 2025, at 2:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > I think your view highlights the need for a shift towards more stable and > version-independent APIs. Spark Connect IMO is a key enabler of this shift, > allowing users and developers to build applications and libraries that are > more resilient to changes in Spark's internals as opposed to RDDs. > > As I stated before, maintaining backward compatibility for the existing > RDD-based applications and libraries is crucial during this transition so > timeframe is another factor for consideration. > > HTH, > > Mich > > > On Sat, 18 Jan 2025 at 22:10, Matei Zaharia <matei.zaha...@gmail.com> > wrote: > >> We definitely need to move the “advanced” users to stable APIs if we want >> Spark to have a good future, such as the Spark Connect plugin APIs. The RDD >> API was the wrong abstraction in my opinion — hopefully I can say that >> since I worked on it. It was too tightly bound to Java types and to >> internal details. I’d love to see more suggestions to help us get away from >> these low-level Java APIs toward things that automatically can be ported >> forward with new Spark versions. Why do platform teams building a new ML >> library have to use internal APIs for example? They should use the same >> query plan + UDF interface that most user code uses, and this would save >> them a ton of maintenance time going forward and help their users benefit >> from the latest changes in Spark (or at least test them) much sooner. >> >> I’m not sure whether it’s clear, but this was absolutely the #1 goal of >> Spark Connect to me — it was to give Spark users and library developers >> (data sources, algorithms, etc) the freedom not to worry about Spark >> version updates. Otherwise, all these projects will eventually be replaced >> by platforms that don’t require you to worry about versions or wait for >> your platform team to port a ton of jobs and libraries over for each >> release. Even shading only gets you so far, and it introduces new problems >> that other users might not like. >> >> On Jan 18, 2025, at 1:42 PM, Holden Karau <holden.ka...@gmail.com> wrote: >> >> I would say the short answer is "mostly not" and the longer answer is >> that the connect APIs are explicitly not covering many, what we would call, >> "paved paths." Because we're more likely to have JAR conflicts with >> advanced users who are more likely to use some of the non-supported APIs. >> For example, some of our biggest JAR conflicts come from other platform >> teams which build platforms on top of Spark (thinking custom machine >> learning tools or special streaming stuff). >> >> It's sort of that classic problem of building something for the 90% but >> the 10% are the ones with the actual issue your trying to avoid. >> >> On Sat, Jan 18, 2025 at 1:26 PM Denny Lee <denny.g....@gmail.com> wrote: >> >>> BTW, one of many reasons Spark Connect was developed was to potentially >>> simplify this process around shading (i.e. not need to do it). I’m >>> wondering if utilizing Spark Connect could be a potential solution here? >>> >>> >>> On Fri, Jan 17, 2025 at 12:27 Holden Karau <holden.ka...@gmail.com> >>> wrote: >>> >>>> +1 I think this is great. If you’ve got any shading you’d be open to >>>> upstreaming I’d be happy to review it. >>>> >>>> Twitter: https://twitter.com/holdenkarau >>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> Pronouns: she/her >>>> >>>> >>>> On Fri, Jan 17, 2025 at 12:25 PM John Zhuge <jzh...@apache.org> wrote: >>>> >>>>> Thanks for sharing the insightful context! >>>>> >>>>> On Fri, Jan 17, 2025 at 11:47 AM Regina Lee <re...@linkedin.com.invalid> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I’d like to share insights from our Spark team at LinkedIn. We >>>>>> recently moved to a mostly shaded Spark 3 client internally. Our goal was >>>>>> to minimize dependency conflicts that could hinder Spark upgrades, >>>>>> especially given our previous efforts to migrate our users from Spark 2 >>>>>> to >>>>>> Spark 3, and LinkedIn’s heavy Scala / Java use cases with complicated >>>>>> dependency trees. We shaded rather aggressively (100+ relocations) given >>>>>> our specific ecosystem needs – Hadoop 2.10 with no current/planned >>>>>> support >>>>>> for Spark streaming / connect modules. >>>>>> >>>>>> At a high level, some notable shaded prefixes included org.json, >>>>>> com.google.common / protobuf, org.apache.commons, and org.antlr. Key >>>>>> dependencies *not* shaded were avro, jackson, datanucleus, logging / >>>>>> JRE / scala dependencies (in general, any dependencies exposed in >>>>>> Spark’s / >>>>>> other dependencies’ public APIs). >>>>>> >>>>>> There is an expected one-time cost in onboarding our Spark users to >>>>>> the shaded client. Most issues require importing missing dependencies >>>>>> originally provided by Spark/Hadoop. We are generally in favor of shading >>>>>> more of Spark’s dependencies because it has helped reduce developer toil >>>>>> and troubleshooting efforts. >>>>>> >>>>>> Thanks, >>>>>> Regina >>>>>> >>>>>> On 2024/12/07 15:30:20 Mich Talebzadeh wrote: >>>>>> > General comment without specifics. I think shading should be used* >>>>>> on a >>>>>> > case by case basis* when the benefits outweigh the drawbacks. How >>>>>> about >>>>>> > exploring alternatives such as modularization, dependency >>>>>> management, or >>>>>> > careful dependency selection, before resorting to shading? My point >>>>>> is that >>>>>> > shading will introduce more debugging and testing as packages will >>>>>> be >>>>>> > renamed impacting flexibility. Case in point, things like unit and >>>>>> > integration tests may need adjustments to account for the renamed >>>>>> packages. >>>>>> > >>>>>> > HTH >>>>>> > >>>>>> > Mich Talebzadeh, >>>>>> > >>>>>> > Architect | Data Science | Financial Crime | GDPR & Compliance >>>>>> Specialist >>>>>> > PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial >>>>>> College >>>>>> > London <https://en.wikipedia.org/wiki/Imperial_College_London> >>>>>> > London, United Kingdom >>>>>> > >>>>>> > >>>>>> > view my Linkedin profile >>>>>> > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>> > >>>>>> > >>>>>> > https://en.everybodywiki.com/Mich_Talebzadeh >>>>>> > >>>>>> > >>>>>> > >>>>>> > *Disclaimer:* The information provided is correct to the best of my >>>>>> > knowledge but of course cannot be guaranteed . It is essential to >>>>>> note >>>>>> > that, as with any advice, quote "one test result is worth >>>>>> one-thousand >>>>>> > expert opinions (Werner < >>>>>> https://en.wikipedia.org/wiki/Wernher_von_Braun>Von >>>>>> > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>> > >>>>>> > >>>>>> > On Sat, 7 Dec 2024 at 06:21, Holden Karau <ho...@gmail.com> wrote: >>>>>> > >>>>>> > > Hi Y'all, >>>>>> > > >>>>>> > > As we're getting closer to 4.0 I was thinking now is a good time >>>>>> for us to >>>>>> > > try and reduce the class path we expose for JVM users. Are there >>>>>> any common >>>>>> > > classes/packages folks would like to see shaded? >>>>>> > > >>>>>> > > Cheers, >>>>>> > > >>>>>> > > Holden :) >>>>>> > > >>>>>> > > -- >>>>>> > > Twitter: https://twitter.com/holdenkarau >>>>>> > > Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>> > > <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>> > > Books (Learning Spark, High Performance Spark, etc.): >>>>>> > > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>> > > YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>> > > Pronouns: she/her >>>>>> > > >>>>>> > >>>>>> >>>>> >>>>> >>>>> -- >>>>> John Zhuge >>>>> >>>> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> Fight Health Insurance: https://www.fighthealthinsurance.com/ >> <https://www.fighthealthinsurance.com/?q=hk_email> >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> Pronouns: she/her >> >> >> >