Re: Spark Docker image with added packages
Hi, Thanks all for the replies. I am adding the Spark dev list as well - as I think this might be an issue that needs to be addressed. The options presented here will get the jars - but they don't help us with dependencies conflicts... For example - com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.0 - uses Guava 30 while Spark 3.5.3 uses Guava 14 - the options here will result with both conflicting. How can one add packages to their Spark (during the build process of the Docker image) - without causing unresolved conflicts? Thanks! Nimrod On Tue, Oct 15, 2024 at 6:53 PM Damien Hawes wrote: > Herewith a more fleshed out example: > > An example of a *build.gradle.kts* file: > > plugins { > id("java") > } > > val sparkJarsDir = > objects.directoryProperty().convention(layout.buildDirectory.dir("sparkJars")) > > repositories { > mavenCentral() > } > > val sparkJars: Configuration by configurations.creating { > isCanBeResolved = true > isCanBeConsumed = false > } > > dependencies { > sparkJars("com.fasterxml.jackson.core:jackson-databind:2.18.0") > } > > val copySparkJars by tasks.registering(Copy::class) { > group = "build" > description = "Copies the appropriate jars to the configured spark jars > directory" > from(sparkJars) > into(sparkJarsDir) > } > > Now, the *Dockerfile*: > > FROM spark:3.5.3-scala2.12-java17-ubuntu > > USER root > > COPY --chown=spark:spark build/sparkJars/* "$SPARK_HOME/jars/" > > USER spark > > > Kind regards, > > Damien > > On Tue, Oct 15, 2024 at 4:19 PM Damien Hawes > wrote: > >> The simplest solution that I have found in solving this was to use Gradle >> (or Maven, if you prefer), and list the dependencies that I want copied to >> $SPARK_HOME/jars as project dependencies. >> >> Summary of steps to follow: >> >> 1. Using your favourite build tool, declare a dependency on your required >> packages. >> 2. Write your Dockerfile, with or without the Spark binaries inside it. >> 3. Using your build tool to copy the dependencies to a location that the >> Docker daemon can access. >> 4. Copy the dependencies into the correct directory. >> 5. Ensure those files have the correct permissions. >> >> In my opinion, it is pretty easy to do this with Gradle. >> >> Op di 15 okt. 2024 15:28 schreef Nimrod Ofek : >> >>> Hi all, >>> >>> I am creating a base Spark image that we are using internally. >>> We need to add some packages to the base image: >>> spark:3.5.1-scala2.12-java17-python3-r-ubuntu >>> >>> Of course I do not want to Start Spark with --packages "..." - as it is >>> not efficient at all - I would like to add the needed jars to the image. >>> >>> Ideally, I would have add to my image something that will add the needed >>> packages - something like: >>> >>> RUN $SPARK_HOME/bin/add-packages "..." >>> >>> But AFAIK there is no such option. >>> >>> Other than running Spark to add those packages and then creating the >>> image - or running Spark always with --packages "..." - what can I do? >>> Is there a way to run just the code that is run by the --package command >>> - without running Spark, so I can add the needed dependencies to my image? >>> >>> I am sure this is something that I am not the only one nor the first one >>> to encounter... >>> >>> Thanks! >>> Nimrod >>> >>> >>> >>
Re: [DISCUSS] Migrate or deprecate the Spark Kinesis connector
Hi Juntaek, Thanks for the information. Given that, are we good to close this Jira https://issues.apache.org/jira/browse/SPARK-45720 ? -Junyu On Wed, Oct 16, 2024 at 8:36 PM Jungtaek Lim wrote: > DStream is deprecated in Spark 3.4.0, hence Kinesis connector for DStream > is inheriting the same fate. We just didn't make the whole class of DStream > to produce warning messages, as we made the entry class to produce warning > messages and thought it's sufficient. > > On Mon, Oct 14, 2024 at 5:03 PM Johnson Chen > wrote: > >> Hi Spark community, >> >> A couple months ago, I raised a PR to upgrade the AWS SDK to v2 for the >> Spark Kinesis connector: https://github.com/apache/spark/pull/44211. >> Given that the 4.0 feature freeze is coming, I am following up to check >> whether we still want to have this change in the upcoming 4.0 release or >> not? If yes, I could revise and rebase the PR accordingly. >> >> Here is the tracking Jira: >> https://issues.apache.org/jira/browse/SPARK-45720 >> >> -- >> Thanks, >> Junyu >> > -- Thanks, Junyu Chen
Re: Spark Docker image with added packages
Hi, That's on you as the maintainer of the derived image to ensure that your added dependencies do not conflict with Spark's dependencies. Speaking from experience, there are several ways to achieve this: 1. Ensure you're using packages that contain shaded and relocated packages, if possible. 2. If you're creating packages of your own, ensure your dependencies (and their transitive dependencies) are compatible with the ones that Spark uses. Otherwise, create your own shaded packages if your packages require different versions. Both build systems can aid you in this. Both build systems also have the ability to give you a dependency report. Maven has the enforcer plugin with the dependency convergence rule to prevent differing versions, this does require you to declare Spark dependencies in the pom (as *provided* scope), though the dependency convergence rule doesn't work on the provided scope by default. Once you enable the *provided* scope, you'll start encountering a lot of dependency issues in the transitive issues that Spark uses. You'll have to work through each of these to get the build to pass. 3.4.3 2.12 11 ${scala.compat.version}.${scala.minor.version} org.slf4j slf4j-api 1.7.36 org.apache.spark spark-sql_${scala.compat.version} ${spark.version} provided org.apache.maven.plugins maven-enforcer-plugin 3.5.0 enforce test enforce org.apache.maven.plugins maven-dependency-plugin 3.8.0 copy-dependencies package copy-dependencies ${project.build.directory}/spark-jars false false true runtime Op do 17 okt. 2024 13:51 schreef Nimrod Ofek : > > Hi, > > Thanks all for the replies. > > I am adding the Spark dev list as well - as I think this might be an issue > that needs to be addressed. > > The options presented here will get the jars - but they don't help us with > dependencies conflicts... > For example - com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.0 - > uses Guava 30 while Spark 3.5.3 uses Guava 14 - the options here will > result with both conflicting. > > How can one add packages to their Spark (during the build process of the > Docker image) - without causing unresolved conflicts? > > Thanks! > Nimrod > > > On Tue, Oct 15, 2024 at 6:53 PM Damien Hawes > wrote: > >> Herewith a more fleshed out example: >> >> An example of a *build.gradle.kts* file: >> >> plugins { >> id("java") >> } >> >> val sparkJarsDir = >> objects.directoryProperty().convention(layout.buildDirectory.dir("sparkJars")) >> >> repositories { >> mavenCentral() >> } >> >> val sparkJars: Configuration by configurations.creating { >> isCanBeResolved = true >> isCanBeConsumed = false >> } >> >> dependencies { >> sparkJars("com.fasterxml.jackson.core:jackson-databind:2.18.0") >> } >> >> val copySparkJars by tasks.registering(Copy::class) { >> group = "build" >> description = "Copies the appropriate jars to the configured spark jars >> directory" >> from(sparkJars) >> into(sparkJarsDir) >> } >> >> Now, the *Dockerfile*: >> >> FROM spark:3.5.3-scala2.12-java17-ubuntu >> >> USER root >> >> COPY --chown=spark:spark build/sparkJars/* "$SPARK_HOME/jars/" >> >> USER spark >> >> >> Kind regards, >> >> Damien >> >> On Tue, Oct 15, 2024 at 4:19 PM Damien Hawes >> wrote: >> >>> The simplest solution that I have found in solving this was to use >>> Gradle (or Maven, if you prefer), and list the dependencies that I want >>> copied to $SPARK_HOME/jars as project dependencies. >>> >>> Summary of steps to follow: >>> >>> 1. Using your favourite build tool, declare a dependency on your >>> required packages. >>> 2. Write your Dockerfile, with or without the Spark binaries inside it. >>> 3. Using your build tool to copy the dependencies to a location that the >>> Docker daemon can access. >>> 4. Copy the dependencies into the correct directory. >>> 5. Ensure those files have the correct permissions. >>> >>> In my opinion, it is pretty easy to do this with Gradle. >>> >>> Op di 15 okt. 2024 15:28 schreef Nimrod Ofek : >>> Hi all, I am creating a ba
Re: Spark Docker image with added packages
Creating a custom classloader to load classes from those jars? El jue, 17 oct 2024, 19:47, Nimrod Ofek escribió: > > Hi, > > Thanks all for the replies. > > I am adding the Spark dev list as well - as I think this might be an issue > that needs to be addressed. > > The options presented here will get the jars - but they don't help us with > dependencies conflicts... > For example - com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.0 - > uses Guava 30 while Spark 3.5.3 uses Guava 14 - the options here will > result with both conflicting. > > How can one add packages to their Spark (during the build process of the > Docker image) - without causing unresolved conflicts? > > Thanks! > Nimrod > > > On Tue, Oct 15, 2024 at 6:53 PM Damien Hawes > wrote: > >> Herewith a more fleshed out example: >> >> An example of a *build.gradle.kts* file: >> >> plugins { >> id("java") >> } >> >> val sparkJarsDir = >> objects.directoryProperty().convention(layout.buildDirectory.dir("sparkJars")) >> >> repositories { >> mavenCentral() >> } >> >> val sparkJars: Configuration by configurations.creating { >> isCanBeResolved = true >> isCanBeConsumed = false >> } >> >> dependencies { >> sparkJars("com.fasterxml.jackson.core:jackson-databind:2.18.0") >> } >> >> val copySparkJars by tasks.registering(Copy::class) { >> group = "build" >> description = "Copies the appropriate jars to the configured spark jars >> directory" >> from(sparkJars) >> into(sparkJarsDir) >> } >> >> Now, the *Dockerfile*: >> >> FROM spark:3.5.3-scala2.12-java17-ubuntu >> >> USER root >> >> COPY --chown=spark:spark build/sparkJars/* "$SPARK_HOME/jars/" >> >> USER spark >> >> >> Kind regards, >> >> Damien >> >> On Tue, Oct 15, 2024 at 4:19 PM Damien Hawes >> wrote: >> >>> The simplest solution that I have found in solving this was to use >>> Gradle (or Maven, if you prefer), and list the dependencies that I want >>> copied to $SPARK_HOME/jars as project dependencies. >>> >>> Summary of steps to follow: >>> >>> 1. Using your favourite build tool, declare a dependency on your >>> required packages. >>> 2. Write your Dockerfile, with or without the Spark binaries inside it. >>> 3. Using your build tool to copy the dependencies to a location that the >>> Docker daemon can access. >>> 4. Copy the dependencies into the correct directory. >>> 5. Ensure those files have the correct permissions. >>> >>> In my opinion, it is pretty easy to do this with Gradle. >>> >>> Op di 15 okt. 2024 15:28 schreef Nimrod Ofek : >>> Hi all, I am creating a base Spark image that we are using internally. We need to add some packages to the base image: spark:3.5.1-scala2.12-java17-python3-r-ubuntu Of course I do not want to Start Spark with --packages "..." - as it is not efficient at all - I would like to add the needed jars to the image. Ideally, I would have add to my image something that will add the needed packages - something like: RUN $SPARK_HOME/bin/add-packages "..." But AFAIK there is no such option. Other than running Spark to add those packages and then creating the image - or running Spark always with --packages "..." - what can I do? Is there a way to run just the code that is run by the --package command - without running Spark, so I can add the needed dependencies to my image? I am sure this is something that I am not the only one nor the first one to encounter... Thanks! Nimrod >>>