Re: Question Regarding Spark Dependencies in Scala

Nimrod Ofek Tue, 03 Jun 2025 12:02:17 -0700

It does not compile if I don't add spark -sql.
In usual projects I'd agree with you, but since Spark comes complete with
all dependencies unlike other programs where you deploy certain
dependencies only- I see no reason for users to select specific
dependencies that are already bundled in the spark server up front.


בתאריך יום ג׳, 3 ביוני 2025, 21:44, מאת Sean Owen ‏<sro...@gmail.com>:

> I think Spark, like any project, is large enough to decompose into
> modules, and it has been. A single app almost surely doesn't need all the
> modules. So yes you have to depend on the modules you actually need, and I
> think that's normal. See Jackson for example.
> (spark-sql is not necessary as it's required by the modules you depend on
> already)
>
> What's the name for this new convenience package? spark-avro-sql-kafka?
> that seems too specific. And what about the 100 other variations that other
> apps need?
> For example, some apps will not need spark-sql-kafka but will need
> spark-streaming-kafka.
>
> You do not have to depend on exactly the same versions of dependencies
> that Spark does, although that's the safest thing to do. For example,
> unless you use Avro directly and its version matters to you, you do not
> declare this in your POM. If you do, that's fine, Maven/SBT decides on what
> version to use based on what you say and what Spark says. And this could be
> wrong, but, that's life in the world of dependencies. Much of the time, it
> works.
>
> On Tue, Jun 3, 2025 at 1:35 PM Nimrod Ofek <ofek.nim...@gmail.com> wrote:
>
>>
>> I'll five an example:
>> If I have a project that reads from Kafka topic avro messages - and
>> writes them to Delta tables, I would expect to set only:
>>
>> libraryDependencies ++= Seq(
>>
>>   "io.delta" %% "delta-spark" % deltaVersion % Provided,
>>   "org.apache.spark" %% "spark-avro" % sparkVersion,
>>   "org.apache.spark" %% "spark-sql-kafka-0-10" % sparkVersion,
>>   "org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion,
>>   "za.co.absa" %% "abris" % "6.4.0",
>>   "org.apache.avro" % "avro" % apacheAvro,
>>   "io.confluent" % "kafka-schema-registry-client" % "7.5.1",
>>   "com.github.pureconfig" %% "pureconfig" % "0.17.5"
>> )
>>
>> And not to add also
>>
>> "org.apache.spark" %% "spark-sql" % sparkVersion % Provided,
>>
>>
>> And to be honest - I don't think that the users really need to understand
>> the internal structure to know what jar they need to add to use each
>> feature...
>> I don't think they need to know what project they need to depend on - as
>> long as it's already provided... They just need to configure spark-provided
>> :)
>>
>> Thanks,
>> Nimrod
>>
>>
>> On Tue, Jun 3, 2025 at 8:57 PM Sean Owen <sro...@gmail.com> wrote:
>>
>>> For sure, but, that is what Maven/SBT do. It resolves your project
>>> dependencies, looking at all their transitive dependencies, according to
>>> some rules.
>>> You do not need to re-declare Spark's dependencies in your project, no.
>>> I'm not quite sure what you mean.
>>>
>>> On Tue, Jun 3, 2025 at 12:55 PM Nimrod Ofek <ofek.nim...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Sean.
>>>> There are other dependencies that you need to align with Spark if you
>>>> need to use them as well - like Guava, Jackson etc.
>>>> I find them more difficult to use - because you need to go to Spark
>>>> repo to check the correct version used - and if there are upgrades between
>>>> versions you need to check that to upgrade as well.
>>>> What do you think?
>>>>
>>>

Re: Question Regarding Spark Dependencies in Scala

Reply via email to