Re: Spark Docker image with added packages

2024-10-17 Thread Nimrod Ofek
Hi,

Thanks all for the replies.

I am adding the Spark dev list as well - as I think this might be an issue
that needs to be addressed.

The options presented here will get the jars - but they don't help us with
dependencies conflicts...
For example - com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.0 -
uses Guava 30 while Spark 3.5.3 uses Guava 14 - the options here will
result with both conflicting.

How can one add packages to their Spark (during the build process of the
Docker image) - without causing unresolved conflicts?

Thanks!
Nimrod


On Tue, Oct 15, 2024 at 6:53 PM Damien Hawes  wrote:

> Herewith a more fleshed out example:
>
> An example of a *build.gradle.kts* file:
>
> plugins {
> id("java")
> }
>
> val sparkJarsDir = 
> objects.directoryProperty().convention(layout.buildDirectory.dir("sparkJars"))
>
> repositories {
> mavenCentral()
> }
>
> val sparkJars: Configuration by configurations.creating {
> isCanBeResolved = true
> isCanBeConsumed = false
> }
>
> dependencies {
> sparkJars("com.fasterxml.jackson.core:jackson-databind:2.18.0")
> }
>
> val copySparkJars by tasks.registering(Copy::class) {
> group = "build"
> description = "Copies the appropriate jars to the configured spark jars 
> directory"
> from(sparkJars)
> into(sparkJarsDir)
> }
>
> Now, the *Dockerfile*:
>
> FROM spark:3.5.3-scala2.12-java17-ubuntu
>
> USER root
>
> COPY --chown=spark:spark build/sparkJars/* "$SPARK_HOME/jars/"
>
> USER spark
>
>
> Kind regards,
>
> Damien
>
> On Tue, Oct 15, 2024 at 4:19 PM Damien Hawes 
> wrote:
>
>> The simplest solution that I have found in solving this was to use Gradle
>> (or Maven, if you prefer), and list the dependencies that I want copied to
>> $SPARK_HOME/jars as project dependencies.
>>
>> Summary of steps to follow:
>>
>> 1. Using your favourite build tool, declare a dependency on your required
>> packages.
>> 2. Write your Dockerfile, with or without the Spark binaries inside it.
>> 3. Using your build tool to copy the dependencies to a location that the
>> Docker daemon can access.
>> 4. Copy the dependencies into the correct directory.
>> 5. Ensure those files have the correct permissions.
>>
>> In my opinion, it is pretty easy to do this with Gradle.
>>
>> Op di 15 okt. 2024 15:28 schreef Nimrod Ofek :
>>
>>> Hi all,
>>>
>>> I am creating a base Spark image that we are using internally.
>>> We need to add some packages to the base image:
>>> spark:3.5.1-scala2.12-java17-python3-r-ubuntu
>>>
>>> Of course I do not want to Start Spark with --packages "..." - as it is
>>> not efficient at all - I would like to add the needed jars to the image.
>>>
>>> Ideally, I would have add to my image something that will add the needed
>>> packages - something like:
>>>
>>> RUN $SPARK_HOME/bin/add-packages "..."
>>>
>>> But AFAIK there is no such option.
>>>
>>> Other than running Spark to add those packages and then creating the
>>> image - or running Spark always with --packages "..."  - what can I do?
>>> Is there a way to run just the code that is run by the --package command
>>> - without running Spark, so I can add the needed dependencies to my image?
>>>
>>> I am sure this is something that I am not the only one nor the first one
>>> to encounter...
>>>
>>> Thanks!
>>> Nimrod
>>>
>>>
>>>
>>


Re: [DISCUSS] Migrate or deprecate the Spark Kinesis connector

2024-10-17 Thread Johnson Chen
Hi Juntaek,

Thanks for the information. Given that, are we good to close this Jira
https://issues.apache.org/jira/browse/SPARK-45720 ?

-Junyu

On Wed, Oct 16, 2024 at 8:36 PM Jungtaek Lim 
wrote:

> DStream is deprecated in Spark 3.4.0, hence Kinesis connector for DStream
> is inheriting the same fate. We just didn't make the whole class of DStream
> to produce warning messages, as we made the entry class to produce warning
> messages and thought it's sufficient.
>
> On Mon, Oct 14, 2024 at 5:03 PM Johnson Chen 
> wrote:
>
>> Hi Spark community,
>>
>> A couple months ago, I raised a PR to upgrade the AWS SDK to v2 for the
>> Spark Kinesis connector: https://github.com/apache/spark/pull/44211.
>> Given that the 4.0 feature freeze is coming, I am following up to check
>> whether we still want to have this change in the upcoming 4.0 release or
>> not? If yes, I could revise and rebase the PR accordingly.
>>
>> Here is the tracking Jira:
>> https://issues.apache.org/jira/browse/SPARK-45720
>>
>> --
>> Thanks,
>> Junyu
>>
>

-- 
Thanks,
Junyu Chen


Re: Spark Docker image with added packages

2024-10-17 Thread Damien Hawes
Hi,

That's on you as the maintainer of the derived image to ensure that your
added dependencies do not conflict with Spark's dependencies. Speaking from
experience, there are several ways to achieve this:

1. Ensure you're using packages that contain shaded and relocated packages,
if possible.
2. If you're creating packages of your own, ensure your dependencies (and
their transitive dependencies) are compatible with the ones that Spark
uses. Otherwise, create your own shaded packages if your packages require
different versions.

Both build systems can aid you in this. Both build systems also have the
ability to give you a dependency report. Maven has the enforcer plugin with
the dependency convergence rule to prevent differing versions, this does
require you to declare Spark dependencies in the pom (as *provided* scope),
though the dependency convergence rule doesn't work on the provided scope
by default. Once you enable the *provided* scope, you'll start encountering
a lot of dependency issues in the transitive issues that Spark uses. You'll
have to work through each of these to get the build to pass.


3.4.3
2.12
11

${scala.compat.version}.${scala.minor.version}




org.slf4j
slf4j-api
1.7.36



org.apache.spark
spark-sql_${scala.compat.version}
${spark.version}
provided






org.apache.maven.plugins
maven-enforcer-plugin
3.5.0


enforce





test





enforce





org.apache.maven.plugins
maven-dependency-plugin
3.8.0


copy-dependencies
package

copy-dependencies



${project.build.directory}/spark-jars
false
false
true
runtime








Op do 17 okt. 2024 13:51 schreef Nimrod Ofek :

>
> Hi,
>
> Thanks all for the replies.
>
> I am adding the Spark dev list as well - as I think this might be an issue
> that needs to be addressed.
>
> The options presented here will get the jars - but they don't help us with
> dependencies conflicts...
> For example - com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.0 -
> uses Guava 30 while Spark 3.5.3 uses Guava 14 - the options here will
> result with both conflicting.
>
> How can one add packages to their Spark (during the build process of the
> Docker image) - without causing unresolved conflicts?
>
> Thanks!
> Nimrod
>
>
> On Tue, Oct 15, 2024 at 6:53 PM Damien Hawes 
> wrote:
>
>> Herewith a more fleshed out example:
>>
>> An example of a *build.gradle.kts* file:
>>
>> plugins {
>> id("java")
>> }
>>
>> val sparkJarsDir = 
>> objects.directoryProperty().convention(layout.buildDirectory.dir("sparkJars"))
>>
>> repositories {
>> mavenCentral()
>> }
>>
>> val sparkJars: Configuration by configurations.creating {
>> isCanBeResolved = true
>> isCanBeConsumed = false
>> }
>>
>> dependencies {
>> sparkJars("com.fasterxml.jackson.core:jackson-databind:2.18.0")
>> }
>>
>> val copySparkJars by tasks.registering(Copy::class) {
>> group = "build"
>> description = "Copies the appropriate jars to the configured spark jars 
>> directory"
>> from(sparkJars)
>> into(sparkJarsDir)
>> }
>>
>> Now, the *Dockerfile*:
>>
>> FROM spark:3.5.3-scala2.12-java17-ubuntu
>>
>> USER root
>>
>> COPY --chown=spark:spark build/sparkJars/* "$SPARK_HOME/jars/"
>>
>> USER spark
>>
>>
>> Kind regards,
>>
>> Damien
>>
>> On Tue, Oct 15, 2024 at 4:19 PM Damien Hawes 
>> wrote:
>>
>>> The simplest solution that I have found in solving this was to use
>>> Gradle (or Maven, if you prefer), and list the dependencies that I want
>>> copied to $SPARK_HOME/jars as project dependencies.
>>>
>>> Summary of steps to follow:
>>>
>>> 1. Using your favourite build tool, declare a dependency on your
>>> required packages.
>>> 2. Write your Dockerfile, with or without the Spark binaries inside it.
>>> 3. Using your build tool to copy the dependencies to a location that the
>>> Docker daemon can access.
>>> 4. Copy the dependencies into the correct directory.
>>> 5. Ensure those files have the correct permissions.
>>>
>>> In my opinion, it is pretty easy to do this with Gradle.
>>>
>>> Op di 15 okt. 2024 15:28 schreef Nimrod Ofek :
>>>
 Hi all,

 I am creating a ba

Re: Spark Docker image with added packages

2024-10-17 Thread Ángel
Creating a custom classloader to load classes from those jars?

El jue, 17 oct 2024, 19:47, Nimrod Ofek  escribió:

>
> Hi,
>
> Thanks all for the replies.
>
> I am adding the Spark dev list as well - as I think this might be an issue
> that needs to be addressed.
>
> The options presented here will get the jars - but they don't help us with
> dependencies conflicts...
> For example - com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.0 -
> uses Guava 30 while Spark 3.5.3 uses Guava 14 - the options here will
> result with both conflicting.
>
> How can one add packages to their Spark (during the build process of the
> Docker image) - without causing unresolved conflicts?
>
> Thanks!
> Nimrod
>
>
> On Tue, Oct 15, 2024 at 6:53 PM Damien Hawes 
> wrote:
>
>> Herewith a more fleshed out example:
>>
>> An example of a *build.gradle.kts* file:
>>
>> plugins {
>> id("java")
>> }
>>
>> val sparkJarsDir = 
>> objects.directoryProperty().convention(layout.buildDirectory.dir("sparkJars"))
>>
>> repositories {
>> mavenCentral()
>> }
>>
>> val sparkJars: Configuration by configurations.creating {
>> isCanBeResolved = true
>> isCanBeConsumed = false
>> }
>>
>> dependencies {
>> sparkJars("com.fasterxml.jackson.core:jackson-databind:2.18.0")
>> }
>>
>> val copySparkJars by tasks.registering(Copy::class) {
>> group = "build"
>> description = "Copies the appropriate jars to the configured spark jars 
>> directory"
>> from(sparkJars)
>> into(sparkJarsDir)
>> }
>>
>> Now, the *Dockerfile*:
>>
>> FROM spark:3.5.3-scala2.12-java17-ubuntu
>>
>> USER root
>>
>> COPY --chown=spark:spark build/sparkJars/* "$SPARK_HOME/jars/"
>>
>> USER spark
>>
>>
>> Kind regards,
>>
>> Damien
>>
>> On Tue, Oct 15, 2024 at 4:19 PM Damien Hawes 
>> wrote:
>>
>>> The simplest solution that I have found in solving this was to use
>>> Gradle (or Maven, if you prefer), and list the dependencies that I want
>>> copied to $SPARK_HOME/jars as project dependencies.
>>>
>>> Summary of steps to follow:
>>>
>>> 1. Using your favourite build tool, declare a dependency on your
>>> required packages.
>>> 2. Write your Dockerfile, with or without the Spark binaries inside it.
>>> 3. Using your build tool to copy the dependencies to a location that the
>>> Docker daemon can access.
>>> 4. Copy the dependencies into the correct directory.
>>> 5. Ensure those files have the correct permissions.
>>>
>>> In my opinion, it is pretty easy to do this with Gradle.
>>>
>>> Op di 15 okt. 2024 15:28 schreef Nimrod Ofek :
>>>
 Hi all,

 I am creating a base Spark image that we are using internally.
 We need to add some packages to the base image:
 spark:3.5.1-scala2.12-java17-python3-r-ubuntu

 Of course I do not want to Start Spark with --packages "..." - as it is
 not efficient at all - I would like to add the needed jars to the image.

 Ideally, I would have add to my image something that will add the
 needed packages - something like:

 RUN $SPARK_HOME/bin/add-packages "..."

 But AFAIK there is no such option.

 Other than running Spark to add those packages and then creating the
 image - or running Spark always with --packages "..."  - what can I do?
 Is there a way to run just the code that is run by the --package
 command - without running Spark, so I can add the needed dependencies to my
 image?

 I am sure this is something that I am not the only one nor the first
 one to encounter...

 Thanks!
 Nimrod



>>>