Hi Nimrod,
This is a method I used back in August 2023 (attached) to build the
dockerfile. A year old but I think it is still valid. In my approach, using
multi-stage builds for Python dependencies is a good way to keep the docker
image lean. For Spark JARs, you can use a similar strategy to ensure the
final image is as small as possible, by installing dependencies in a
temporary build stage and copying only the necessary files to the final
image. In short the principles of bundling dependencies into the Docker
image at build time, avoiding runtime dependency installation, and
leveraging CI/CD pipelines are all equally useful for managing Spark JARs.
HTH
Mich Talebzadeh,
Architect | Data Engineer | Data Science | Financial Crime
PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College
London <https://en.wikipedia.org/wiki/Imperial_College_London>
London, United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
On Tue, 15 Oct 2024 at 19:45, Nimrod Ofek <ofek.nim...@gmail.com> wrote:
> Hi all,
>
> I am creating a base Spark image that we are using internally.
> We need to add some packages to the base image:
> spark:3.5.1-scala2.12-java17-python3-r-ubuntu
>
> Of course I do not want to Start Spark with --packages "..." - as it is
> not efficient at all - I would like to add the needed jars to the image.
>
> Ideally, I would have add to my image something that will add the needed
> packages - something like:
>
> RUN $SPARK_HOME/bin/add-packages "..."
>
> But AFAIK there is no such option.
>
> Other than running Spark to add those packages and then creating the image
> - or running Spark always with --packages "..." - what can I do?
> Is there a way to run just the code that is run by the --package command -
> without running Spark, so I can add the needed dependencies to my image?
>
> I am sure this is something that I am not the only one nor the first one
> to encounter...
>
> Thanks!
> Nimrod
>
>
>
--- Begin Message ---
Hi,
This is a bit of an old hat but worth getting opinions on it.
Current options that I believe apply are:
1. Installing them individually via pip in the docker build process
2. Installing them together via pip in the build process via
requirments.txt
3. Installing them to a volume and adding the volume to the PYTHONPATH
>From my experience there is a case of installing them at docker build
process:
RUN pip install pyyaml --no-cache-dir
RUN pip install --no-cache-dir -r requirements.txt
or using the following in spark-submit
--archives pyspark_venv.tar.gz#environment
The problem with archives as I have noticed that unzipping and
untarring packages takes a considerable time and sometimes spark-submit
hangs! with in-built docker the version of package may get out of date,
although this has not been an issue for me.
So there are pros and cons either way. However, with the CICD pipeline, we
can build docker files with higher frequencies if needed.
Docker files have a drawback of the more packages, the more the docker size
and of course pulling it all from the container registry (ecr, gcr etc),
will consume more time and will impact the deployment time. I still favour
1 or 2 above.
Thanks
Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
--- End Message ---
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org