But... is it not like that in any other Java/Scala/Python/... app that uses dependencies that also have their own dependencies?
If you want to provide a library, maybe you should give the user the option to decide if they want an all-in-one ubber jar with shaded (more difficult to debug) dependencies included or a lighter jar with only your code. I personally prefer the second option; maybe it's less user-friendly but it's definitely more developer-savy because you're totally aware about the versions you're executing, and can perfectly upgrade versions (where compatible) if bugs or vulnerabilities are detected, for example. El vie, 6 jun 2025, 10:09, Sem <ssinche...@apache.org> escribió: > > I may not need anything from spark but if I'll declare a dependency in > Jackson or guava with a different version than spark already use and > package- I might break things... > > In that case I would recommend you to use assembly / assemblyShadeRules > for sbt-assembly or maven-shade-plugin for maven and shade dependencies > like jackson or guava to avoid conflicts with spark when you pack > everything to the ubber jar. > > On Wed, 2025-06-04 at 11:52 +0300, Nimrod Ofek wrote: > > Yes, that was my point. > If I'm directly using something or not - it is really there, so it would > be beneficial for me to have a way of knowing what are the exact > dependencies that I have even if I don't use them directly in case a or b- > because they are there. > > For instance, if I am creating a library for Delta that helps track the > lag in structured streaming delta to delta tables streams - I may not need > anything from spark but if I'll declare a dependency in Jackson or guava > with a different version than spark already use and package- I might break > things... Because I'll add Jackson or guava in my ubber jar- and that will > cause issues with the out of the box deployed jars... > > בתאריך יום ד׳, 4 ביוני 2025, 01:38, מאת Sean Owen <sro...@gmail.com>: > > Yes, you're just saying that if your app depends on Foo, and Spark depends > on Foo, then ideally you depend on the exact same version Spark uses. > Otherwise it's up to Maven/SBT to pick one or the other version, which > might or might not be suitable. Yes, dependency conflicts are painful to > deal with and a real thing everywhere, and this gets into discussions like, > why isn't everything shaded? but that's not the point here I think. > > But if your app depends on Foo, then Foo is in your POM regardless of what > Spark does. It gets painful to figure out if that conflicts with Spark's > dependencies, sure, but you can figure it out with dependency:tree or > similar, but I also don't think adding a POM-only module changes any of > that? You still have the same problem even if there is a spark-uber package > depending on every module. > > KNowing which submodule is of interest - that does take some work. It's > hopefully in the docs, and most apps just need spark-sql, but I can see > this as an issue. > > I could see an argument for declaring a single POM-only artifact that > depends on all Spark modules. Then you depend on that as 'provided' and you > have all of Spark in compile scope only. (This is almost what spark-parent > does but I don't think it works that way). It feels inaccurate, and not > helpful for most use cases, but I don't see a major problem with it > actually. Your dependency graph gets a lot bigger with stuff you don't > need, but it's all in provided scope anyway. > > On Tue, Jun 3, 2025 at 5:23 PM Nimrod Ofek <ofek.nim...@gmail.com> wrote: > > You don't add dependencies you don't use- but you do need to declare > dependencies you do use, and if the platform you are running use a specific > version you need to use that version- you can't break comparability. > Since spark uses a lot of dependencies - I don't expect the user to check > if spark use for instance Jackson, and what version. > I also didn't expect the ordinary user to know if spark structured > streaming uses spark sql or not when they need both- especially when they > are already packaged together in the spark server. > > Having said that, I guess that they will just try adding packages and is > something won't compile they will use courser to fix the dependencies... > > Thanks anyway! > > בתאריך יום ג׳, 3 ביוני 2025, 22:09, מאת Sean Owen <sro...@gmail.com>: > > Do you have an example of what you mean? > > Yes, a deployment of Spark has all the modules. You do not need to (should > not in fact) deploy Spark code with your Spark app for this reason. > You still need to express dependencies on the Spark code that your app > uses at *compile* time however, in order to compile, or else how can it > compile? > You do not add dependencies that you do not directly use, no. > This is like any other multi-module project in the Maven/SBT ecosystem. > > On Tue, Jun 3, 2025 at 1:59 PM Nimrod Ofek <ofek.nim...@gmail.com> wrote: > > It does not compile if I don't add spark -sql. > In usual projects I'd agree with you, but since Spark comes complete with > all dependencies unlike other programs where you deploy certain > dependencies only- I see no reason for users to select specific > dependencies that are already bundled in the spark server up front. > > בתאריך יום ג׳, 3 ביוני 2025, 21:44, מאת Sean Owen <sro...@gmail.com>: > > I think Spark, like any project, is large enough to decompose into > modules, and it has been. A single app almost surely doesn't need all the > modules. So yes you have to depend on the modules you actually need, and I > think that's normal. See Jackson for example. > (spark-sql is not necessary as it's required by the modules you depend on > already) > > What's the name for this new convenience package? spark-avro-sql-kafka? > that seems too specific. And what about the 100 other variations that other > apps need? > For example, some apps will not need spark-sql-kafka but will need > spark-streaming-kafka. > > You do not have to depend on exactly the same versions of dependencies > that Spark does, although that's the safest thing to do. For example, > unless you use Avro directly and its version matters to you, you do not > declare this in your POM. If you do, that's fine, Maven/SBT decides on what > version to use based on what you say and what Spark says. And this could be > wrong, but, that's life in the world of dependencies. Much of the time, it > works. > > On Tue, Jun 3, 2025 at 1:35 PM Nimrod Ofek <ofek.nim...@gmail.com> wrote: > > > I'll five an example: > If I have a project that reads from Kafka topic avro messages - and writes > them to Delta tables, I would expect to set only: > > libraryDependencies ++= Seq( > > "io.delta" %% "delta-spark" % deltaVersion % Provided, > "org.apache.spark" %% "spark-avro" % sparkVersion, > "org.apache.spark" %% "spark-sql-kafka-0-10" % sparkVersion, > "org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion, > "za.co.absa" %% "abris" % "6.4.0", > "org.apache.avro" % "avro" % apacheAvro, > "io.confluent" % "kafka-schema-registry-client" % "7.5.1", > "com.github.pureconfig" %% "pureconfig" % "0.17.5" > ) > > And not to add also > > "org.apache.spark" %% "spark-sql" % sparkVersion % Provided, > > > And to be honest - I don't think that the users really need to understand > the internal structure to know what jar they need to add to use each > feature... > I don't think they need to know what project they need to depend on - as > long as it's already provided... They just need to configure spark-provided > :) > > Thanks, > Nimrod > > > On Tue, Jun 3, 2025 at 8:57 PM Sean Owen <sro...@gmail.com> wrote: > > For sure, but, that is what Maven/SBT do. It resolves your project > dependencies, looking at all their transitive dependencies, according to > some rules. > You do not need to re-declare Spark's dependencies in your project, no. > I'm not quite sure what you mean. > > On Tue, Jun 3, 2025 at 12:55 PM Nimrod Ofek <ofek.nim...@gmail.com> wrote: > > Thanks Sean. > There are other dependencies that you need to align with Spark if you need > to use them as well - like Guava, Jackson etc. > I find them more difficult to use - because you need to go to Spark repo > to check the correct version used - and if there are upgrades between > versions you need to check that to upgrade as well. > What do you think? > > >