It does not compile if I don't add spark -sql. In usual projects I'd agree with you, but since Spark comes complete with all dependencies unlike other programs where you deploy certain dependencies only- I see no reason for users to select specific dependencies that are already bundled in the spark server up front.
בתאריך יום ג׳, 3 ביוני 2025, 21:44, מאת Sean Owen <sro...@gmail.com>: > I think Spark, like any project, is large enough to decompose into > modules, and it has been. A single app almost surely doesn't need all the > modules. So yes you have to depend on the modules you actually need, and I > think that's normal. See Jackson for example. > (spark-sql is not necessary as it's required by the modules you depend on > already) > > What's the name for this new convenience package? spark-avro-sql-kafka? > that seems too specific. And what about the 100 other variations that other > apps need? > For example, some apps will not need spark-sql-kafka but will need > spark-streaming-kafka. > > You do not have to depend on exactly the same versions of dependencies > that Spark does, although that's the safest thing to do. For example, > unless you use Avro directly and its version matters to you, you do not > declare this in your POM. If you do, that's fine, Maven/SBT decides on what > version to use based on what you say and what Spark says. And this could be > wrong, but, that's life in the world of dependencies. Much of the time, it > works. > > On Tue, Jun 3, 2025 at 1:35 PM Nimrod Ofek <ofek.nim...@gmail.com> wrote: > >> >> I'll five an example: >> If I have a project that reads from Kafka topic avro messages - and >> writes them to Delta tables, I would expect to set only: >> >> libraryDependencies ++= Seq( >> >> "io.delta" %% "delta-spark" % deltaVersion % Provided, >> "org.apache.spark" %% "spark-avro" % sparkVersion, >> "org.apache.spark" %% "spark-sql-kafka-0-10" % sparkVersion, >> "org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion, >> "za.co.absa" %% "abris" % "6.4.0", >> "org.apache.avro" % "avro" % apacheAvro, >> "io.confluent" % "kafka-schema-registry-client" % "7.5.1", >> "com.github.pureconfig" %% "pureconfig" % "0.17.5" >> ) >> >> And not to add also >> >> "org.apache.spark" %% "spark-sql" % sparkVersion % Provided, >> >> >> And to be honest - I don't think that the users really need to understand >> the internal structure to know what jar they need to add to use each >> feature... >> I don't think they need to know what project they need to depend on - as >> long as it's already provided... They just need to configure spark-provided >> :) >> >> Thanks, >> Nimrod >> >> >> On Tue, Jun 3, 2025 at 8:57 PM Sean Owen <sro...@gmail.com> wrote: >> >>> For sure, but, that is what Maven/SBT do. It resolves your project >>> dependencies, looking at all their transitive dependencies, according to >>> some rules. >>> You do not need to re-declare Spark's dependencies in your project, no. >>> I'm not quite sure what you mean. >>> >>> On Tue, Jun 3, 2025 at 12:55 PM Nimrod Ofek <ofek.nim...@gmail.com> >>> wrote: >>> >>>> Thanks Sean. >>>> There are other dependencies that you need to align with Spark if you >>>> need to use them as well - like Guava, Jackson etc. >>>> I find them more difficult to use - because you need to go to Spark >>>> repo to check the correct version used - and if there are upgrades between >>>> versions you need to check that to upgrade as well. >>>> What do you think? >>>> >>>