Re: Spark Docker image with added packages

2024-10-17 Thread Damien Hawes
ns here will > result with both conflicting. > > How can one add packages to their Spark (during the build process of the > Docker image) - without causing unresolved conflicts? > > Thanks! > Nimrod > > > On Tue, Oct 15, 2024 at 6:53 PM Damien Hawes > wrote: >

Re: Spark Docker image with added packages

2024-10-15 Thread Damien Hawes
appropriate jars to the configured spark jars directory" from(sparkJars) into(sparkJarsDir) } Now, the *Dockerfile*: FROM spark:3.5.3-scala2.12-java17-ubuntu USER root COPY --chown=spark:spark build/sparkJars/* "$SPARK_HOME/jars/" USER spark Kind regards, Damien On Tue,

Re: Spark Docker image with added packages

2024-10-15 Thread Damien Hawes
The simplest solution that I have found in solving this was to use Gradle (or Maven, if you prefer), and list the dependencies that I want copied to $SPARK_HOME/jars as project dependencies. Summary of steps to follow: 1. Using your favourite build tool, declare a dependency on your required pack

Re: pyspark dataframe join with two different data type

2024-05-10 Thread Damien Hawes
Right now, with the structure of your data, it isn't possible. The rows aren't duplicates of each other. "a" and "b" both exist in the array. So Spark is correctly performing the join. It looks like you need to find another way to model this data to get what you want to achieve. Are the values of

[SparkListener] Accessing classes loaded via the '--packages' option

2024-04-26 Thread Damien Hawes
Hi folks, I'm contributing to the OpenLineage project, specifically the Apache Spark integration. My current focus is on extending the project to support data lineage extraction for Spark Streaming, beginning with Apache Kafka sources and sinks. I've encountered an obstacle when attempting to acc