Can you share the logs, settings, environment, etc. and file a JIRA? There are integration test cases for K8S support, and I myself also tested it before. It would be helpful if you try what I did at https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html and see if it works.
On Mon, 6 Dec 2021 at 17:22, Bode, Meikel, NMA-CFD < meikel.b...@bertelsmann.de> wrote: > Hi Mich, > > > > Thanks for your response. Yes –py-files options works. I also tested it. > > The question is why the –archives option doesn’t? > > > > From Jira I can see that it should be available since 3.1.0: > > > > https://issues.apache.org/jira/browse/SPARK-33530 > > https://issues.apache.org/jira/browse/SPARK-33615 > > > > Best, > > Meikel > > > > > > *From:* Mich Talebzadeh <mich.talebza...@gmail.com> > *Sent:* Samstag, 4. Dezember 2021 18:36 > *To:* Bode, Meikel, NMA-CFD <meikel.b...@bertelsmann.de> > *Cc:* dev <d...@spark.apache.org>; user@spark.apache.org > *Subject:* Re: Conda Python Env in K8S > > > > > Hi Meikel > > > > In the past I tried with > > > > --py-files > hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/DSBQ.zip \ > > --archives > hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/pyspark_venv.zip#pyspark_venv \ > > > > which is basically what you are doing. the first line --py-files works but > the second one fails > > > > It tried to unpack them ? It tries to unpack them > > > > Unpacking an archive hdfs:// > 50.140.197.220:9000/minikube/codes/pyspark_venv.zip#pyspark_venv > <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2F50.140.197.220%3A9000%2Fminikube%2Fcodes%2Fpyspark_venv.zip%23pyspark_venv&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cf9716ed642fe4c92be6f08d9b74c98bd%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637742362326413635%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=UinOKIfYC16iRnLiibB9kXsvoiEZ10DfVzHlKqJZTHk%3D&reserved=0> > from > /tmp/spark-502a5b57-0fe6-45bd-867d-9738e678e9a3/pyspark_venv.zip to > /opt/spark/work-dir/./pyspark_venv > > > > But it failed. > > > > This could be due to creating the virtual environment inside the docker in > the work-dir *o*r sometimes when there is not enough available memory to > gunzip and untar the file, especially if your executors are built on > cluster nodes with less memory than the driver node. > > > > However, The most convenient way to add additional packages to the docker > image is to add them directly to the docker image at time of creating the > image. So external packages are bundled as a part of my docker image > because it is fixed and if an application requires those set of > dependencies every time, they are there. Also note that every time you put > RUN statement it creates an intermediate container and hence it increases > build time. So reduce it as follows > > RUN pip install pyyaml numpy cx_Oracle --no-cache-dir > > The --no-cheche-dir option to pip is to prevent the downloaded binaries from > being added to the image, reducing the image size. It is also advisable to > install all packages in one line. Every time you put RUN statement it creates > an intermediate container and hence it increases the build time. So reduce it > by putting all packages in one line. > > Log in to the docker image and check for Python packages installed > > docker run -u 0 -it > spark/spark-py:3.1.1-scala_2.12-8-jre-slim-buster_java8PlusPackages bash > > root@5bc049af7278:/opt/spark/work-dir# pip list > > Package Version > > ---------- ------- > > cx-Oracle 8.3.0 > > numpy 1.21.4 > > pip 21.3.1 > > PyYAML 6.0 > > setuptools 59.4.0 > > wheel 0.34.2 > > HTH > > > > view my Linkedin profile > <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fin%2Fmich-talebzadeh-ph-d-5205b2%2F&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cf9716ed642fe4c92be6f08d9b74c98bd%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637742362326413635%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=i0NSWMcUHWBNBMV2Qe%2BejnJyFSNfGQkEs9KMh0OS5uY%3D&reserved=0> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > > > > On Sat, 4 Dec 2021 at 07:52, Bode, Meikel, NMA-CFD < > meikel.b...@bertelsmann.de> wrote: > > Hi Mich, > > > > sure thats possible. But distributing the complete env would be more > practical. > > A workaround at the moment is, that we build different environments and > store them in a pv and then we mount it into the pods and refer from the > SparkApplication resource to the desired env.. > > > > But actually these options exist and I want to understand what the issue > is… > > Any hints on that? > > > > Best, > > Meikel > > > > *From:* Mich Talebzadeh <mich.talebza...@gmail.com> > *Sent:* Freitag, 3. Dezember 2021 13:27 > *To:* Bode, Meikel, NMA-CFD <meikel.b...@bertelsmann.de> > *Cc:* dev <d...@spark.apache.org>; user@spark.apache.org > *Subject:* Re: Conda Python Env in K8S > > > > Build python packages into the docker image itself first with pip install > > > > RUN pip install panda . . —no-cache > > > > HTH > > > > On Fri, 3 Dec 2021 at 11:58, Bode, Meikel, NMA-CFD < > meikel.b...@bertelsmann.de> wrote: > > Hello, > > > > I am trying to run spark jobs using Spark Kubernetes Operator. > > But when I try to bundle a conda python environment using the following > resource description the python interpreter is only unpack to the driver > and not to the executors. > > > > apiVersion: "sparkoperator.k8s.io/v1beta2 > <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsparkoperator.k8s.io%2Fv1beta2&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cf9716ed642fe4c92be6f08d9b74c98bd%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637742362326423593%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=swEklqFu8LNW%2FJ2ul0bKoyZFcrlqLFzRumUqgXVUV9c%3D&reserved=0> > " > > kind: SparkApplication > > metadata: > > name: … > > spec: > > type: Python > > pythonVersion: "3" > > mode: cluster > > mainApplicationFile: local:///path/script.py > > .. > > sparkConf: > > "spark.archives": "local:///path/conda-env.tar.gz#environment" > > "spark.pyspark.python": "./environment/bin/python" > > "spark.pyspark.driver.python": "./environment/bin/python" > > > > > > The driver is unpacking the archive and the python scripts gets executed. > > On executors there is no log message indicating that the archive gets > unpacked. > > Executors then fail as they cant find the python executable at the given > location "./environment/bin/python". > > > > Any hint? > > > > Best, > > Meikel > > -- > > > > > > view my Linkedin profile > <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fin%2Fmich-talebzadeh-ph-d-5205b2%2F&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cf9716ed642fe4c92be6f08d9b74c98bd%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637742362326423593%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=sVHXR3mf715WXi9ZQQeUeud5ZaGmL%2FSFUuB9JQl%2BGqk%3D&reserved=0> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > >