Re: Question Regarding Spark Dependencies in Scala

Nimrod Ofek Wed, 04 Jun 2025 01:53:19 -0700

Yes, that was my point.
If I'm directly using something or not - it is really there, so it would be
beneficial for me to have a way of knowing what are the exact dependencies
that I have even if I don't use them directly in case a or b- because they
are there.


For instance, if I am creating a library for Delta that helps track the lag
in structured streaming delta to delta tables streams - I may not need
anything from spark but if I'll declare a dependency in Jackson or guava
with a different version than spark already use and package- I might break
things... Because I'll add Jackson or guava in my ubber jar- and that will
cause issues with the out of the box deployed jars...

בתאריך יום ד׳, 4 ביוני 2025, 01:38, מאת Sean Owen ‏<sro...@gmail.com>:

> Yes, you're just saying that if your app depends on Foo, and Spark depends
> on Foo, then ideally you depend on the exact same version Spark uses.
> Otherwise it's up to Maven/SBT to pick one or the other version, which
> might or might not be suitable. Yes, dependency conflicts are painful to
> deal with and a real thing everywhere, and this gets into discussions like,
> why isn't everything shaded? but that's not the point here I think.
>
> But if your app depends on Foo, then Foo is in your POM regardless of what
> Spark does. It gets painful to figure out if that conflicts with Spark's
> dependencies, sure, but you can figure it out with dependency:tree or
> similar, but I also don't think adding a POM-only module changes any of
> that? You still have the same problem even if there is a spark-uber package
> depending on every module.
>
> KNowing which submodule is of interest - that does take some work. It's
> hopefully in the docs, and most apps just need spark-sql, but I can see
> this as an issue.
>
> I could see an argument for declaring a single POM-only artifact that
> depends on all Spark modules. Then you depend on that as 'provided' and you
> have all of Spark in compile scope only. (This is almost what spark-parent
> does but I don't think it works that way). It feels inaccurate, and not
> helpful for most use cases, but I don't see a major problem with it
> actually. Your dependency graph gets a lot bigger with stuff you don't
> need, but it's all in provided scope anyway.
>
> On Tue, Jun 3, 2025 at 5:23 PM Nimrod Ofek <ofek.nim...@gmail.com> wrote:
>
>> You don't add dependencies you don't use- but you do need to declare
>> dependencies you do use, and if the platform you are running use a specific
>> version you need to use that version- you can't break comparability.
>> Since spark uses a lot of dependencies - I don't expect the user to check
>> if spark use for instance Jackson, and what version.
>> I also didn't expect the ordinary user to know if spark structured
>> streaming uses spark sql or not when they need both- especially when they
>> are already packaged together in the spark server.
>>
>> Having said that, I guess that they will just try adding packages and is
>> something won't compile they will use courser to fix the dependencies...
>>
>> Thanks anyway!
>>
>> בתאריך יום ג׳, 3 ביוני 2025, 22:09, מאת Sean Owen ‏<sro...@gmail.com>:
>>
>>> Do you have an example of what you mean?
>>>
>>> Yes, a deployment of Spark has all the modules. You do not need to
>>> (should not in fact) deploy Spark code with your Spark app for this reason.
>>> You still need to express dependencies on the Spark code that your app
>>> uses at *compile* time however, in order to compile, or else how can it
>>> compile?
>>> You do not add dependencies that you do not directly use, no.
>>> This is like any other multi-module project in the Maven/SBT ecosystem.
>>>
>>> On Tue, Jun 3, 2025 at 1:59 PM Nimrod Ofek <ofek.nim...@gmail.com>
>>> wrote:
>>>
>>>> It does not compile if I don't add spark -sql.
>>>> In usual projects I'd agree with you, but since Spark comes complete
>>>> with all dependencies unlike other programs where you deploy certain
>>>> dependencies only- I see no reason for users to select specific
>>>> dependencies that are already bundled in the spark server up front.
>>>>
>>>> בתאריך יום ג׳, 3 ביוני 2025, 21:44, מאת Sean Owen ‏<sro...@gmail.com>:
>>>>
>>>>> I think Spark, like any project, is large enough to decompose into
>>>>> modules, and it has been. A single app almost surely doesn't need all the
>>>>> modules. So yes you have to depend on the modules you actually need, and I
>>>>> think that's normal. See Jackson for example.
>>>>> (spark-sql is not necessary as it's required by the modules you depend
>>>>> on already)
>>>>>
>>>>> What's the name for this new convenience package?
>>>>> spark-avro-sql-kafka? that seems too specific. And what about the 100
>>>>> other variations that other apps need?
>>>>> For example, some apps will not need spark-sql-kafka but will need
>>>>> spark-streaming-kafka.
>>>>>
>>>>> You do not have to depend on exactly the same versions of dependencies
>>>>> that Spark does, although that's the safest thing to do. For example,
>>>>> unless you use Avro directly and its version matters to you, you do not
>>>>> declare this in your POM. If you do, that's fine, Maven/SBT decides on 
>>>>> what
>>>>> version to use based on what you say and what Spark says. And this could 
>>>>> be
>>>>> wrong, but, that's life in the world of dependencies. Much of the time, it
>>>>> works.
>>>>>
>>>>> On Tue, Jun 3, 2025 at 1:35 PM Nimrod Ofek <ofek.nim...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> I'll five an example:
>>>>>> If I have a project that reads from Kafka topic avro messages - and
>>>>>> writes them to Delta tables, I would expect to set only:
>>>>>>
>>>>>> libraryDependencies ++= Seq(
>>>>>>
>>>>>>   "io.delta" %% "delta-spark" % deltaVersion % Provided,
>>>>>>   "org.apache.spark" %% "spark-avro" % sparkVersion,
>>>>>>   "org.apache.spark" %% "spark-sql-kafka-0-10" % sparkVersion,
>>>>>>   "org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion,
>>>>>>   "za.co.absa" %% "abris" % "6.4.0",
>>>>>>   "org.apache.avro" % "avro" % apacheAvro,
>>>>>>   "io.confluent" % "kafka-schema-registry-client" % "7.5.1",
>>>>>>   "com.github.pureconfig" %% "pureconfig" % "0.17.5"
>>>>>> )
>>>>>>
>>>>>> And not to add also
>>>>>>
>>>>>> "org.apache.spark" %% "spark-sql" % sparkVersion % Provided,
>>>>>>
>>>>>>
>>>>>> And to be honest - I don't think that the users really need to
>>>>>> understand the internal structure to know what jar they need to add to 
>>>>>> use
>>>>>> each feature...
>>>>>> I don't think they need to know what project they need to depend on -
>>>>>> as long as it's already provided... They just need to configure
>>>>>> spark-provided :)
>>>>>>
>>>>>> Thanks,
>>>>>> Nimrod
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 3, 2025 at 8:57 PM Sean Owen <sro...@gmail.com> wrote:
>>>>>>
>>>>>>> For sure, but, that is what Maven/SBT do. It resolves your project
>>>>>>> dependencies, looking at all their transitive dependencies, according to
>>>>>>> some rules.
>>>>>>> You do not need to re-declare Spark's dependencies in your project,
>>>>>>> no.
>>>>>>> I'm not quite sure what you mean.
>>>>>>>
>>>>>>> On Tue, Jun 3, 2025 at 12:55 PM Nimrod Ofek <ofek.nim...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks Sean.
>>>>>>>> There are other dependencies that you need to align with Spark if
>>>>>>>> you need to use them as well - like Guava, Jackson etc.
>>>>>>>> I find them more difficult to use - because you need to go to Spark
>>>>>>>> repo to check the correct version used - and if there are upgrades 
>>>>>>>> between
>>>>>>>> versions you need to check that to upgrade as well.
>>>>>>>> What do you think?
>>>>>>>>
>>>>>>>

Re: Question Regarding Spark Dependencies in Scala

Reply via email to