Re: Question Regarding Spark Dependencies in Scala

Ángel Álvarez Pascua Fri, 06 Jun 2025 07:32:51 -0700

But... is it not like that in any other Java/Scala/Python/... app that uses
dependencies that also have their own dependencies?


If you want to provide a library, maybe you should give the user the option
to decide if they want an all-in-one ubber jar with shaded (more difficult
to debug) dependencies included or a lighter jar with only your code. I
personally prefer the second option; maybe it's less user-friendly but it's
definitely more developer-savy because you're totally aware about the
versions you're executing, and can perfectly upgrade versions (where
compatible) if bugs or vulnerabilities are detected,  for example.

El vie, 6 jun 2025, 10:09, Sem <ssinche...@apache.org> escribió:

> > I may not need anything from spark but if I'll declare a dependency in
> Jackson or guava with a different version than spark already use and
> package- I might break things...
>
> In that case I would recommend you to use assembly / assemblyShadeRules
> for sbt-assembly or maven-shade-plugin for maven and shade dependencies
> like jackson or guava to avoid conflicts with spark when you pack
> everything to the ubber jar.
>
> On Wed, 2025-06-04 at 11:52 +0300, Nimrod Ofek wrote:
>
> Yes, that was my point.
> If I'm directly using something or not - it is really there, so it would
> be beneficial for me to have a way of knowing what are the exact
> dependencies that I have even if I don't use them directly in case a or b-
> because they are there.
>
> For instance, if I am creating a library for Delta that helps track the
> lag in structured streaming delta to delta tables streams - I may not need
> anything from spark but if I'll declare a dependency in Jackson or guava
> with a different version than spark already use and package- I might break
> things... Because I'll add Jackson or guava in my ubber jar- and that will
> cause issues with the out of the box deployed jars...
>
> בתאריך יום ד׳, 4 ביוני 2025, 01:38, מאת Sean Owen ‏<sro...@gmail.com>:
>
> Yes, you're just saying that if your app depends on Foo, and Spark depends
> on Foo, then ideally you depend on the exact same version Spark uses.
> Otherwise it's up to Maven/SBT to pick one or the other version, which
> might or might not be suitable. Yes, dependency conflicts are painful to
> deal with and a real thing everywhere, and this gets into discussions like,
> why isn't everything shaded? but that's not the point here I think.
>
> But if your app depends on Foo, then Foo is in your POM regardless of what
> Spark does. It gets painful to figure out if that conflicts with Spark's
> dependencies, sure, but you can figure it out with dependency:tree or
> similar, but I also don't think adding a POM-only module changes any of
> that? You still have the same problem even if there is a spark-uber package
> depending on every module.
>
> KNowing which submodule is of interest - that does take some work. It's
> hopefully in the docs, and most apps just need spark-sql, but I can see
> this as an issue.
>
> I could see an argument for declaring a single POM-only artifact that
> depends on all Spark modules. Then you depend on that as 'provided' and you
> have all of Spark in compile scope only. (This is almost what spark-parent
> does but I don't think it works that way). It feels inaccurate, and not
> helpful for most use cases, but I don't see a major problem with it
> actually. Your dependency graph gets a lot bigger with stuff you don't
> need, but it's all in provided scope anyway.
>
> On Tue, Jun 3, 2025 at 5:23 PM Nimrod Ofek <ofek.nim...@gmail.com> wrote:
>
> You don't add dependencies you don't use- but you do need to declare
> dependencies you do use, and if the platform you are running use a specific
> version you need to use that version- you can't break comparability.
> Since spark uses a lot of dependencies - I don't expect the user to check
> if spark use for instance Jackson, and what version.
> I also didn't expect the ordinary user to know if spark structured
> streaming uses spark sql or not when they need both- especially when they
> are already packaged together in the spark server.
>
> Having said that, I guess that they will just try adding packages and is
> something won't compile they will use courser to fix the dependencies...
>
> Thanks anyway!
>
> בתאריך יום ג׳, 3 ביוני 2025, 22:09, מאת Sean Owen ‏<sro...@gmail.com>:
>
> Do you have an example of what you mean?
>
> Yes, a deployment of Spark has all the modules. You do not need to (should
> not in fact) deploy Spark code with your Spark app for this reason.
> You still need to express dependencies on the Spark code that your app
> uses at *compile* time however, in order to compile, or else how can it
> compile?
> You do not add dependencies that you do not directly use, no.
> This is like any other multi-module project in the Maven/SBT ecosystem.
>
> On Tue, Jun 3, 2025 at 1:59 PM Nimrod Ofek <ofek.nim...@gmail.com> wrote:
>
> It does not compile if I don't add spark -sql.
> In usual projects I'd agree with you, but since Spark comes complete with
> all dependencies unlike other programs where you deploy certain
> dependencies only- I see no reason for users to select specific
> dependencies that are already bundled in the spark server up front.
>
> בתאריך יום ג׳, 3 ביוני 2025, 21:44, מאת Sean Owen ‏<sro...@gmail.com>:
>
> I think Spark, like any project, is large enough to decompose into
> modules, and it has been. A single app almost surely doesn't need all the
> modules. So yes you have to depend on the modules you actually need, and I
> think that's normal. See Jackson for example.
> (spark-sql is not necessary as it's required by the modules you depend on
> already)
>
> What's the name for this new convenience package? spark-avro-sql-kafka?
> that seems too specific. And what about the 100 other variations that other
> apps need?
> For example, some apps will not need spark-sql-kafka but will need
> spark-streaming-kafka.
>
> You do not have to depend on exactly the same versions of dependencies
> that Spark does, although that's the safest thing to do. For example,
> unless you use Avro directly and its version matters to you, you do not
> declare this in your POM. If you do, that's fine, Maven/SBT decides on what
> version to use based on what you say and what Spark says. And this could be
> wrong, but, that's life in the world of dependencies. Much of the time, it
> works.
>
> On Tue, Jun 3, 2025 at 1:35 PM Nimrod Ofek <ofek.nim...@gmail.com> wrote:
>
>
> I'll five an example:
> If I have a project that reads from Kafka topic avro messages - and writes
> them to Delta tables, I would expect to set only:
>
> libraryDependencies ++= Seq(
>
>   "io.delta" %% "delta-spark" % deltaVersion % Provided,
>   "org.apache.spark" %% "spark-avro" % sparkVersion,
>   "org.apache.spark" %% "spark-sql-kafka-0-10" % sparkVersion,
>   "org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion,
>   "za.co.absa" %% "abris" % "6.4.0",
>   "org.apache.avro" % "avro" % apacheAvro,
>   "io.confluent" % "kafka-schema-registry-client" % "7.5.1",
>   "com.github.pureconfig" %% "pureconfig" % "0.17.5"
> )
>
> And not to add also
>
> "org.apache.spark" %% "spark-sql" % sparkVersion % Provided,
>
>
> And to be honest - I don't think that the users really need to understand
> the internal structure to know what jar they need to add to use each
> feature...
> I don't think they need to know what project they need to depend on - as
> long as it's already provided... They just need to configure spark-provided
> :)
>
> Thanks,
> Nimrod
>
>
> On Tue, Jun 3, 2025 at 8:57 PM Sean Owen <sro...@gmail.com> wrote:
>
> For sure, but, that is what Maven/SBT do. It resolves your project
> dependencies, looking at all their transitive dependencies, according to
> some rules.
> You do not need to re-declare Spark's dependencies in your project, no.
> I'm not quite sure what you mean.
>
> On Tue, Jun 3, 2025 at 12:55 PM Nimrod Ofek <ofek.nim...@gmail.com> wrote:
>
> Thanks Sean.
> There are other dependencies that you need to align with Spark if you need
> to use them as well - like Guava, Jackson etc.
> I find them more difficult to use - because you need to go to Spark repo
> to check the correct version used - and if there are upgrades between
> versions you need to check that to upgrade as well.
> What do you think?
>
>
>

Re: Question Regarding Spark Dependencies in Scala

Reply via email to