Re: Question Regarding Spark Dependencies in Scala

Sean Owen Tue, 03 Jun 2025 15:39:39 -0700

Yes, you're just saying that if your app depends on Foo, and Spark depends
on Foo, then ideally you depend on the exact same version Spark uses.
Otherwise it's up to Maven/SBT to pick one or the other version, which
might or might not be suitable. Yes, dependency conflicts are painful to
deal with and a real thing everywhere, and this gets into discussions like,
why isn't everything shaded? but that's not the point here I think.


But if your app depends on Foo, then Foo is in your POM regardless of what
Spark does. It gets painful to figure out if that conflicts with Spark's
dependencies, sure, but you can figure it out with dependency:tree or
similar, but I also don't think adding a POM-only module changes any of
that? You still have the same problem even if there is a spark-uber package
depending on every module.

KNowing which submodule is of interest - that does take some work. It's
hopefully in the docs, and most apps just need spark-sql, but I can see
this as an issue.

I could see an argument for declaring a single POM-only artifact that
depends on all Spark modules. Then you depend on that as 'provided' and you
have all of Spark in compile scope only. (This is almost what spark-parent
does but I don't think it works that way). It feels inaccurate, and not
helpful for most use cases, but I don't see a major problem with it
actually. Your dependency graph gets a lot bigger with stuff you don't
need, but it's all in provided scope anyway.

On Tue, Jun 3, 2025 at 5:23 PM Nimrod Ofek <[email protected]> wrote:

> You don't add dependencies you don't use- but you do need to declare
> dependencies you do use, and if the platform you are running use a specific
> version you need to use that version- you can't break comparability.
> Since spark uses a lot of dependencies - I don't expect the user to check
> if spark use for instance Jackson, and what version.
> I also didn't expect the ordinary user to know if spark structured
> streaming uses spark sql or not when they need both- especially when they
> are already packaged together in the spark server.
>
> Having said that, I guess that they will just try adding packages and is
> something won't compile they will use courser to fix the dependencies...
>
> Thanks anyway!
>
> בתאריך יום ג׳, 3 ביוני 2025, 22:09, מאת Sean Owen ‏<[email protected]>:
>
>> Do you have an example of what you mean?
>>
>> Yes, a deployment of Spark has all the modules. You do not need to
>> (should not in fact) deploy Spark code with your Spark app for this reason.
>> You still need to express dependencies on the Spark code that your app
>> uses at *compile* time however, in order to compile, or else how can it
>> compile?
>> You do not add dependencies that you do not directly use, no.
>> This is like any other multi-module project in the Maven/SBT ecosystem.
>>
>> On Tue, Jun 3, 2025 at 1:59 PM Nimrod Ofek <[email protected]> wrote:
>>
>>> It does not compile if I don't add spark -sql.
>>> In usual projects I'd agree with you, but since Spark comes complete
>>> with all dependencies unlike other programs where you deploy certain
>>> dependencies only- I see no reason for users to select specific
>>> dependencies that are already bundled in the spark server up front.
>>>
>>> בתאריך יום ג׳, 3 ביוני 2025, 21:44, מאת Sean Owen ‏<[email protected]>:
>>>
>>>> I think Spark, like any project, is large enough to decompose into
>>>> modules, and it has been. A single app almost surely doesn't need all the
>>>> modules. So yes you have to depend on the modules you actually need, and I
>>>> think that's normal. See Jackson for example.
>>>> (spark-sql is not necessary as it's required by the modules you depend
>>>> on already)
>>>>
>>>> What's the name for this new convenience package? spark-avro-sql-kafka?
>>>> that seems too specific. And what about the 100 other variations that other
>>>> apps need?
>>>> For example, some apps will not need spark-sql-kafka but will need
>>>> spark-streaming-kafka.
>>>>
>>>> You do not have to depend on exactly the same versions of dependencies
>>>> that Spark does, although that's the safest thing to do. For example,
>>>> unless you use Avro directly and its version matters to you, you do not
>>>> declare this in your POM. If you do, that's fine, Maven/SBT decides on what
>>>> version to use based on what you say and what Spark says. And this could be
>>>> wrong, but, that's life in the world of dependencies. Much of the time, it
>>>> works.
>>>>
>>>> On Tue, Jun 3, 2025 at 1:35 PM Nimrod Ofek <[email protected]>
>>>> wrote:
>>>>
>>>>>
>>>>> I'll five an example:
>>>>> If I have a project that reads from Kafka topic avro messages - and
>>>>> writes them to Delta tables, I would expect to set only:
>>>>>
>>>>> libraryDependencies ++= Seq(
>>>>>
>>>>>   "io.delta" %% "delta-spark" % deltaVersion % Provided,
>>>>>   "org.apache.spark" %% "spark-avro" % sparkVersion,
>>>>>   "org.apache.spark" %% "spark-sql-kafka-0-10" % sparkVersion,
>>>>>   "org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion,
>>>>>   "za.co.absa" %% "abris" % "6.4.0",
>>>>>   "org.apache.avro" % "avro" % apacheAvro,
>>>>>   "io.confluent" % "kafka-schema-registry-client" % "7.5.1",
>>>>>   "com.github.pureconfig" %% "pureconfig" % "0.17.5"
>>>>> )
>>>>>
>>>>> And not to add also
>>>>>
>>>>> "org.apache.spark" %% "spark-sql" % sparkVersion % Provided,
>>>>>
>>>>>
>>>>> And to be honest - I don't think that the users really need to
>>>>> understand the internal structure to know what jar they need to add to use
>>>>> each feature...
>>>>> I don't think they need to know what project they need to depend on -
>>>>> as long as it's already provided... They just need to configure
>>>>> spark-provided :)
>>>>>
>>>>> Thanks,
>>>>> Nimrod
>>>>>
>>>>>
>>>>> On Tue, Jun 3, 2025 at 8:57 PM Sean Owen <[email protected]> wrote:
>>>>>
>>>>>> For sure, but, that is what Maven/SBT do. It resolves your project
>>>>>> dependencies, looking at all their transitive dependencies, according to
>>>>>> some rules.
>>>>>> You do not need to re-declare Spark's dependencies in your project,
>>>>>> no.
>>>>>> I'm not quite sure what you mean.
>>>>>>
>>>>>> On Tue, Jun 3, 2025 at 12:55 PM Nimrod Ofek <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Sean.
>>>>>>> There are other dependencies that you need to align with Spark if
>>>>>>> you need to use them as well - like Guava, Jackson etc.
>>>>>>> I find them more difficult to use - because you need to go to Spark
>>>>>>> repo to check the correct version used - and if there are upgrades 
>>>>>>> between
>>>>>>> versions you need to check that to upgrade as well.
>>>>>>> What do you think?
>>>>>>>
>>>>>>

Re: Question Regarding Spark Dependencies in Scala

Reply via email to