Yes, that was my point. If I'm directly using something or not - it is really there, so it would be beneficial for me to have a way of knowing what are the exact dependencies that I have even if I don't use them directly in case a or b- because they are there.
For instance, if I am creating a library for Delta that helps track the lag in structured streaming delta to delta tables streams - I may not need anything from spark but if I'll declare a dependency in Jackson or guava with a different version than spark already use and package- I might break things... Because I'll add Jackson or guava in my ubber jar- and that will cause issues with the out of the box deployed jars... בתאריך יום ד׳, 4 ביוני 2025, 01:38, מאת Sean Owen <sro...@gmail.com>: > Yes, you're just saying that if your app depends on Foo, and Spark depends > on Foo, then ideally you depend on the exact same version Spark uses. > Otherwise it's up to Maven/SBT to pick one or the other version, which > might or might not be suitable. Yes, dependency conflicts are painful to > deal with and a real thing everywhere, and this gets into discussions like, > why isn't everything shaded? but that's not the point here I think. > > But if your app depends on Foo, then Foo is in your POM regardless of what > Spark does. It gets painful to figure out if that conflicts with Spark's > dependencies, sure, but you can figure it out with dependency:tree or > similar, but I also don't think adding a POM-only module changes any of > that? You still have the same problem even if there is a spark-uber package > depending on every module. > > KNowing which submodule is of interest - that does take some work. It's > hopefully in the docs, and most apps just need spark-sql, but I can see > this as an issue. > > I could see an argument for declaring a single POM-only artifact that > depends on all Spark modules. Then you depend on that as 'provided' and you > have all of Spark in compile scope only. (This is almost what spark-parent > does but I don't think it works that way). It feels inaccurate, and not > helpful for most use cases, but I don't see a major problem with it > actually. Your dependency graph gets a lot bigger with stuff you don't > need, but it's all in provided scope anyway. > > On Tue, Jun 3, 2025 at 5:23 PM Nimrod Ofek <ofek.nim...@gmail.com> wrote: > >> You don't add dependencies you don't use- but you do need to declare >> dependencies you do use, and if the platform you are running use a specific >> version you need to use that version- you can't break comparability. >> Since spark uses a lot of dependencies - I don't expect the user to check >> if spark use for instance Jackson, and what version. >> I also didn't expect the ordinary user to know if spark structured >> streaming uses spark sql or not when they need both- especially when they >> are already packaged together in the spark server. >> >> Having said that, I guess that they will just try adding packages and is >> something won't compile they will use courser to fix the dependencies... >> >> Thanks anyway! >> >> בתאריך יום ג׳, 3 ביוני 2025, 22:09, מאת Sean Owen <sro...@gmail.com>: >> >>> Do you have an example of what you mean? >>> >>> Yes, a deployment of Spark has all the modules. You do not need to >>> (should not in fact) deploy Spark code with your Spark app for this reason. >>> You still need to express dependencies on the Spark code that your app >>> uses at *compile* time however, in order to compile, or else how can it >>> compile? >>> You do not add dependencies that you do not directly use, no. >>> This is like any other multi-module project in the Maven/SBT ecosystem. >>> >>> On Tue, Jun 3, 2025 at 1:59 PM Nimrod Ofek <ofek.nim...@gmail.com> >>> wrote: >>> >>>> It does not compile if I don't add spark -sql. >>>> In usual projects I'd agree with you, but since Spark comes complete >>>> with all dependencies unlike other programs where you deploy certain >>>> dependencies only- I see no reason for users to select specific >>>> dependencies that are already bundled in the spark server up front. >>>> >>>> בתאריך יום ג׳, 3 ביוני 2025, 21:44, מאת Sean Owen <sro...@gmail.com>: >>>> >>>>> I think Spark, like any project, is large enough to decompose into >>>>> modules, and it has been. A single app almost surely doesn't need all the >>>>> modules. So yes you have to depend on the modules you actually need, and I >>>>> think that's normal. See Jackson for example. >>>>> (spark-sql is not necessary as it's required by the modules you depend >>>>> on already) >>>>> >>>>> What's the name for this new convenience package? >>>>> spark-avro-sql-kafka? that seems too specific. And what about the 100 >>>>> other variations that other apps need? >>>>> For example, some apps will not need spark-sql-kafka but will need >>>>> spark-streaming-kafka. >>>>> >>>>> You do not have to depend on exactly the same versions of dependencies >>>>> that Spark does, although that's the safest thing to do. For example, >>>>> unless you use Avro directly and its version matters to you, you do not >>>>> declare this in your POM. If you do, that's fine, Maven/SBT decides on >>>>> what >>>>> version to use based on what you say and what Spark says. And this could >>>>> be >>>>> wrong, but, that's life in the world of dependencies. Much of the time, it >>>>> works. >>>>> >>>>> On Tue, Jun 3, 2025 at 1:35 PM Nimrod Ofek <ofek.nim...@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> I'll five an example: >>>>>> If I have a project that reads from Kafka topic avro messages - and >>>>>> writes them to Delta tables, I would expect to set only: >>>>>> >>>>>> libraryDependencies ++= Seq( >>>>>> >>>>>> "io.delta" %% "delta-spark" % deltaVersion % Provided, >>>>>> "org.apache.spark" %% "spark-avro" % sparkVersion, >>>>>> "org.apache.spark" %% "spark-sql-kafka-0-10" % sparkVersion, >>>>>> "org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion, >>>>>> "za.co.absa" %% "abris" % "6.4.0", >>>>>> "org.apache.avro" % "avro" % apacheAvro, >>>>>> "io.confluent" % "kafka-schema-registry-client" % "7.5.1", >>>>>> "com.github.pureconfig" %% "pureconfig" % "0.17.5" >>>>>> ) >>>>>> >>>>>> And not to add also >>>>>> >>>>>> "org.apache.spark" %% "spark-sql" % sparkVersion % Provided, >>>>>> >>>>>> >>>>>> And to be honest - I don't think that the users really need to >>>>>> understand the internal structure to know what jar they need to add to >>>>>> use >>>>>> each feature... >>>>>> I don't think they need to know what project they need to depend on - >>>>>> as long as it's already provided... They just need to configure >>>>>> spark-provided :) >>>>>> >>>>>> Thanks, >>>>>> Nimrod >>>>>> >>>>>> >>>>>> On Tue, Jun 3, 2025 at 8:57 PM Sean Owen <sro...@gmail.com> wrote: >>>>>> >>>>>>> For sure, but, that is what Maven/SBT do. It resolves your project >>>>>>> dependencies, looking at all their transitive dependencies, according to >>>>>>> some rules. >>>>>>> You do not need to re-declare Spark's dependencies in your project, >>>>>>> no. >>>>>>> I'm not quite sure what you mean. >>>>>>> >>>>>>> On Tue, Jun 3, 2025 at 12:55 PM Nimrod Ofek <ofek.nim...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks Sean. >>>>>>>> There are other dependencies that you need to align with Spark if >>>>>>>> you need to use them as well - like Guava, Jackson etc. >>>>>>>> I find them more difficult to use - because you need to go to Spark >>>>>>>> repo to check the correct version used - and if there are upgrades >>>>>>>> between >>>>>>>> versions you need to check that to upgrade as well. >>>>>>>> What do you think? >>>>>>>> >>>>>>>