On 2025/01/18 22:35:59 Mich Talebzadeh wrote: > I think your view highlights the need for a shift towards more stable and > version-independent APIs. Spark Connect IMO is a key enabler of this shift, > allowing users and developers to build applications and libraries that are > more resilient to changes in Spark's internals as opposed to RDDs. >
Yup. The vital phrase here is “stable and version-independent-APIs”. > As I stated before, maintaining backward compatibility for the existing > RDD-based applications and libraries is crucial during this transition so > timeframe is another factor for consideration. > This is a crucially vital for long-term transition, and as Holden rightfully points out the 90/10 issues for that 10% Cheers Jules > > Mich > > > On Sat, 18 Jan 2025 at 22:10, Matei Zaharia <ma...@gmail.com> wrote: > > > We definitely need to move the “advanced” users to stable APIs if we want > > Spark to have a good future, such as the Spark Connect plugin APIs. The RDD > > API was the wrong abstraction in my opinion — hopefully I can say that > > since I worked on it. It was too tightly bound to Java types and to > > internal details. I’d love to see more suggestions to help us get away from > > these low-level Java APIs toward things that automatically can be ported > > forward with new Spark versions. Why do platform teams building a new ML > > library have to use internal APIs for example? They should use the same > > query plan + UDF interface that most user code uses, and this would save > > them a ton of maintenance time going forward and help their users benefit > > from the latest changes in Spark (or at least test them) much sooner. > > > > I’m not sure whether it’s clear, but this was absolutely the #1 goal of > > Spark Connect to me — it was to give Spark users and library developers > > (data sources, algorithms, etc) the freedom not to worry about Spark > > version updates. Otherwise, all these projects will eventually be replaced > > by platforms that don’t require you to worry about versions or wait for > > your platform team to port a ton of jobs and libraries over for each > > release. Even shading only gets you so far, and it introduces new problems > > that other users might not like. > > > > On Jan 18, 2025, at 1:42 PM, Holden Karau <ho...@gmail.com> wrote: > > > > I would say the short answer is "mostly not" and the longer answer is that > > the connect APIs are explicitly not covering many, what we would call, > > "paved paths." Because we're more likely to have JAR conflicts with > > advanced users who are more likely to use some of the non-supported APIs. > > For example, some of our biggest JAR conflicts come from other platform > > teams which build platforms on top of Spark (thinking custom machine > > learning tools or special streaming stuff). > > > > It's sort of that classic problem of building something for the 90% but > > the 10% are the ones with the actual issue your trying to avoid. > > > > On Sat, Jan 18, 2025 at 1:26 PM Denny Lee <de...@gmail.com> wrote: > > > >> BTW, one of many reasons Spark Connect was developed was to potentially > >> simplify this process around shading (i.e. not need to do it). I’m > >> wondering if utilizing Spark Connect could be a potential solution here? > >> > >> > >> On Fri, Jan 17, 2025 at 12:27 Holden Karau <ho...@gmail.com> > >> wrote: > >> > >>> +1 I think this is great. If you’ve got any shading you’d be open to > >>> upstreaming I’d be happy to review it. > >>> > >>> Twitter: https://twitter.com/holdenkarau > >>> Fight Health Insurance: https://www.fighthealthinsurance.com/ > >>> <https://www.fighthealthinsurance.com/?q=hk_email> > >>> Books (Learning Spark, High Performance Spark, etc.): > >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau > >>> Pronouns: she/her > >>> > >>> > >>> On Fri, Jan 17, 2025 at 12:25 PM John Zhuge <jz...@apache.org> wrote: > >>> > >>>> Thanks for sharing the insightful context! > >>>> > >>>> On Fri, Jan 17, 2025 at 11:47 AM Regina Lee <re...@linkedin.com.invalid> > >>>> wrote: > >>>> > >>>>> Hi, > >>>>> > >>>>> I’d like to share insights from our Spark team at LinkedIn. We > >>>>> recently moved to a mostly shaded Spark 3 client internally. Our goal > >>>>> was > >>>>> to minimize dependency conflicts that could hinder Spark upgrades, > >>>>> especially given our previous efforts to migrate our users from Spark 2 > >>>>> to > >>>>> Spark 3, and LinkedIn’s heavy Scala / Java use cases with complicated > >>>>> dependency trees. We shaded rather aggressively (100+ relocations) given > >>>>> our specific ecosystem needs – Hadoop 2.10 with no current/planned > >>>>> support > >>>>> for Spark streaming / connect modules. > >>>>> > >>>>> At a high level, some notable shaded prefixes included org.json, > >>>>> com.google.common / protobuf, org.apache.commons, and org.antlr. Key > >>>>> dependencies *not* shaded were avro, jackson, datanucleus, logging / > >>>>> JRE / scala dependencies (in general, any dependencies exposed in > >>>>> Spark’s / > >>>>> other dependencies’ public APIs). > >>>>> > >>>>> There is an expected one-time cost in onboarding our Spark users to > >>>>> the shaded client. Most issues require importing missing dependencies > >>>>> originally provided by Spark/Hadoop. We are generally in favor of > >>>>> shading > >>>>> more of Spark’s dependencies because it has helped reduce developer toil > >>>>> and troubleshooting efforts. > >>>>> > >>>>> Thanks, > >>>>> Regina > >>>>> > >>>>> On 2024/12/07 15:30:20 Mich Talebzadeh wrote: > >>>>> > General comment without specifics. I think shading should be used* > >>>>> on a > >>>>> > case by case basis* when the benefits outweigh the drawbacks. How > >>>>> about > >>>>> > exploring alternatives such as modularization, dependency > >>>>> management, or > >>>>> > careful dependency selection, before resorting to shading? My point > >>>>> is that > >>>>> > shading will introduce more debugging and testing as packages will be > >>>>> > renamed impacting flexibility. Case in point, things like unit and > >>>>> > integration tests may need adjustments to account for the renamed > >>>>> packages. > >>>>> > > >>>>> > HTH > >>>>> > > >>>>> > Mich Talebzadeh, > >>>>> > > >>>>> > Architect | Data Science | Financial Crime | GDPR & Compliance > >>>>> Specialist > >>>>> > PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial > >>>>> College > >>>>> > London <https://en.wikipedia.org/wiki/Imperial_College_London> > >>>>> > London, United Kingdom > >>>>> > > >>>>> > > >>>>> > view my Linkedin profile > >>>>> > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > >>>>> > > >>>>> > > >>>>> > https://en.everybodywiki.com/Mich_Talebzadeh > >>>>> > > >>>>> > > >>>>> > > >>>>> > *Disclaimer:* The information provided is correct to the best of my > >>>>> > knowledge but of course cannot be guaranteed . It is essential to > >>>>> note > >>>>> > that, as with any advice, quote "one test result is worth > >>>>> one-thousand > >>>>> > expert opinions (Werner < > >>>>> https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > >>>>> > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > >>>>> > > >>>>> > > >>>>> > On Sat, 7 Dec 2024 at 06:21, Holden Karau <ho...@gmail.com> wrote: > >>>>> > > >>>>> > > Hi Y'all, > >>>>> > > > >>>>> > > As we're getting closer to 4.0 I was thinking now is a good time > >>>>> for us to > >>>>> > > try and reduce the class path we expose for JVM users. Are there > >>>>> any common > >>>>> > > classes/packages folks would like to see shaded? > >>>>> > > > >>>>> > > Cheers, > >>>>> > > > >>>>> > > Holden :) > >>>>> > > > >>>>> > > -- > >>>>> > > Twitter: https://twitter.com/holdenkarau > >>>>> > > Fight Health Insurance: https://www.fighthealthinsurance.com/ > >>>>> > > <https://www.fighthealthinsurance.com/?q=hk_email> > >>>>> > > Books (Learning Spark, High Performance Spark, etc.): > >>>>> > > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > >>>>> > > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > >>>>> > > Pronouns: she/her > >>>>> > > > >>>>> > > >>>>> > >>>> > >>>> > >>>> -- > >>>> John Zhuge > >>>> > >>> > > > > -- > > Twitter: https://twitter.com/holdenkarau > > Fight Health Insurance: https://www.fighthealthinsurance.com/ > > <https://www.fighthealthinsurance.com/?q=hk_email> > > Books (Learning Spark, High Performance Spark, etc.): > > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > > Pronouns: she/her > > > > > > > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org