Re: Conda Python Env in K8S

Hyukjin Kwon Fri, 24 Dec 2021 04:56:30 -0800

Can you share the logs, settings, environment, etc. and file a JIRA? There
are integration test cases for K8S support, and I myself also tested it
before.
It would be helpful if you try what I did at
https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html
and see if it works.


On Mon, 6 Dec 2021 at 17:22, Bode, Meikel, NMA-CFD <
meikel.b...@bertelsmann.de> wrote:

> Hi Mich,
>
>
>
> Thanks for your response. Yes –py-files options works. I also tested it.
>
> The question is why the –archives option doesn’t?
>
>
>
> From Jira I can see that it should be available since 3.1.0:
>
>
>
> https://issues.apache.org/jira/browse/SPARK-33530
>
> https://issues.apache.org/jira/browse/SPARK-33615
>
>
>
> Best,
>
> Meikel
>
>
>
>
>
> *From:* Mich Talebzadeh <mich.talebza...@gmail.com>
> *Sent:* Samstag, 4. Dezember 2021 18:36
> *To:* Bode, Meikel, NMA-CFD <meikel.b...@bertelsmann.de>
> *Cc:* dev <d...@spark.apache.org>; user@spark.apache.org
> *Subject:* Re: Conda Python Env in K8S
>
>
>
>
> Hi Meikel
>
>
>
> In the past I tried with
>
>
>
>            --py-files
> hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/DSBQ.zip \
>
>            --archives
> hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/pyspark_venv.zip#pyspark_venv \
>
>
>
> which is basically what you are doing. the first line --py-files works but
> the second one fails
>
>
>
> It tried to unpack them ? It tries to unpack them
>
>
>
> Unpacking an archive hdfs://
> 50.140.197.220:9000/minikube/codes/pyspark_venv.zip#pyspark_venv
> <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2F50.140.197.220%3A9000%2Fminikube%2Fcodes%2Fpyspark_venv.zip%23pyspark_venv&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cf9716ed642fe4c92be6f08d9b74c98bd%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637742362326413635%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=UinOKIfYC16iRnLiibB9kXsvoiEZ10DfVzHlKqJZTHk%3D&reserved=0>
>  from
> /tmp/spark-502a5b57-0fe6-45bd-867d-9738e678e9a3/pyspark_venv.zip to
> /opt/spark/work-dir/./pyspark_venv
>
>
>
> But it failed.
>
>
>
> This could be due to creating the virtual environment inside the docker in
> the work-dir *o*r sometimes when there is not enough available memory to
> gunzip and untar the file, especially if your executors are built on
> cluster nodes with less memory than the driver node.
>
>
>
> However, The most convenient way to add additional packages to the docker
> image is to add them directly to the docker image at time of creating the
> image. So external packages are bundled as a part of my docker image
> because it is fixed and if an application requires those set of
> dependencies every time, they are there. Also note that every time you put
> RUN statement it creates an intermediate container and hence it increases
> build time. So reduce it as follows
>
> RUN pip install pyyaml numpy cx_Oracle --no-cache-dir
>
> The --no-cheche-dir option to pip is to prevent the downloaded binaries from 
> being added to the image, reducing the image size. It is also advisable to 
> install all packages in one line. Every time you put RUN statement it creates 
> an intermediate container and hence it increases the build time. So reduce it 
> by putting all packages in one line.
>
> Log in to the docker image and check for Python packages installed
>
> docker run -u 0 -it 
> spark/spark-py:3.1.1-scala_2.12-8-jre-slim-buster_java8PlusPackages bash
>
> root@5bc049af7278:/opt/spark/work-dir# pip list
>
> Package    Version
>
> ---------- -------
>
> cx-Oracle  8.3.0
>
> numpy      1.21.4
>
> pip        21.3.1
>
> PyYAML     6.0
>
> setuptools 59.4.0
>
> wheel      0.34.2
>
> HTH
>
>
>
>    view my Linkedin profile
> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fin%2Fmich-talebzadeh-ph-d-5205b2%2F&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cf9716ed642fe4c92be6f08d9b74c98bd%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637742362326413635%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=i0NSWMcUHWBNBMV2Qe%2BejnJyFSNfGQkEs9KMh0OS5uY%3D&reserved=0>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Sat, 4 Dec 2021 at 07:52, Bode, Meikel, NMA-CFD <
> meikel.b...@bertelsmann.de> wrote:
>
> Hi Mich,
>
>
>
> sure thats possible. But distributing the complete env would be more
> practical.
>
> A workaround at the moment is, that we build different environments and
> store them in a pv and then we mount it into the pods and refer from the
> SparkApplication resource to the desired env..
>
>
>
> But actually these options exist and I want to understand what the issue
> is…
>
> Any hints on that?
>
>
>
> Best,
>
> Meikel
>
>
>
> *From:* Mich Talebzadeh <mich.talebza...@gmail.com>
> *Sent:* Freitag, 3. Dezember 2021 13:27
> *To:* Bode, Meikel, NMA-CFD <meikel.b...@bertelsmann.de>
> *Cc:* dev <d...@spark.apache.org>; user@spark.apache.org
> *Subject:* Re: Conda Python Env in K8S
>
>
>
> Build python packages into the docker image itself first with pip install
>
>
>
> RUN pip install panda . . —no-cache
>
>
>
> HTH
>
>
>
> On Fri, 3 Dec 2021 at 11:58, Bode, Meikel, NMA-CFD <
> meikel.b...@bertelsmann.de> wrote:
>
> Hello,
>
>
>
> I am trying to run spark jobs using Spark Kubernetes Operator.
>
> But when I try to bundle a conda python environment using the following
> resource description the python interpreter is only unpack to the driver
> and not to the executors.
>
>
>
> apiVersion: "sparkoperator.k8s.io/v1beta2
> <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsparkoperator.k8s.io%2Fv1beta2&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cf9716ed642fe4c92be6f08d9b74c98bd%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637742362326423593%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=swEklqFu8LNW%2FJ2ul0bKoyZFcrlqLFzRumUqgXVUV9c%3D&reserved=0>
> "
>
> kind: SparkApplication
>
> metadata:
>
>   name: …
>
> spec:
>
>   type: Python
>
>   pythonVersion: "3"
>
>   mode: cluster
>
>   mainApplicationFile: local:///path/script.py
>
> ..
>
>   sparkConf:
>
>     "spark.archives": "local:///path/conda-env.tar.gz#environment"
>
>     "spark.pyspark.python": "./environment/bin/python"
>
> "spark.pyspark.driver.python": "./environment/bin/python"
>
>
>
>
>
> The driver is unpacking the archive and the python scripts gets executed.
>
> On executors there is no log message indicating that the archive gets
> unpacked.
>
> Executors then fail as they cant find the python executable at the given
> location "./environment/bin/python".
>
>
>
> Any hint?
>
>
>
> Best,
>
> Meikel
>
> --
>
>
>
>
>
>    view my Linkedin profile
> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fin%2Fmich-talebzadeh-ph-d-5205b2%2F&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cf9716ed642fe4c92be6f08d9b74c98bd%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637742362326423593%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=sVHXR3mf715WXi9ZQQeUeud5ZaGmL%2FSFUuB9JQl%2BGqk%3D&reserved=0>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>

Re: Conda Python Env in K8S

Reply via email to