Re: Unpacking and using external modules with PySpark inside k8s

Mich Talebzadeh Wed, 21 Jul 2021 15:36:38 -0700

I managed to sort this one out.

Please see


https://stackoverflow.com/questions/68461865/unpacking-and-using-external-modules-with-pyspark-inside-kubernetes/68476548#68476548

HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 21 Jul 2021 at 18:10, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

>
> Hi,
>
> I am aware that some fellow members in this dev group were involved in
> creating scripts for running spark on kubernetes
>
> # To build additional PySpark docker image$ ./bin/docker-image-tool.sh -r 
> <repo> -t my-tag -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile 
> build
>
>
> The problem I have explained is to be able to unpack packages like yaml
> and pandas inside k8s
>
>
> I am using
>
>
>         spark-submit --verbose \
>            --master k8s://$K8S_SERVER \
>
>  --archives=hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/pyspark_venv.tar.gz
> \
>            --deploy-mode cluster \
>            --name pytest \
>            --conf spark.kubernetes.namespace=spark \
>            --conf spark.executor.instances=1 \
>            --conf spark.kubernetes.driver.limit.cores=1 \
>            --conf spark.executor.cores=1 \
>            --conf spark.executor.memory=500m \
>            --conf spark.kubernetes.container.image=${IMAGE} \
>            --conf
> spark.kubernetes.authenticate.driver.serviceAccountName=spark-serviceaccount
> \
>            --py-files
> hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/DSBQ.zip \
>            hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/${APPLICATION}
>
>
> The directory containing code is zipped as DSBQ.zip and it reads it ok.
>
>
> However, it says in verbose mode
>
>
> 2021-07-21 17:01:29,038 WARN util.NativeCodeLoader: Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> Unpacking an archive hdfs://
> 50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz from
> /tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/pyspark_venv.tar.gz to
> /opt/spark/work-dir/./pyspark_venv.tar.gz
>
>
> In this case it tries to use pandas
>
>
> The module ${APPLICATION} has this code
>
>
> import sys
> import os
> import pkgutil
> import pkg_resources
>
> def main():
>     print("\n printing sys.path")
>     for p in sys.path:
>        print(p)
>     user_paths = os.environ['PYTHONPATH'].split(os.pathsep)
>     print("\n Printing user_paths")
>     for p in user_paths:
>        print(p)
>     v = sys.version
>     print("\n python version")
>     print(v)
>     print("\nlooping over pkg_resources.working_set")
>     for r in pkg_resources.working_set:
>        print(r)
>     import pandas
>
> if __name__ == "__main__":
>   main()
>
>
> The output is shown below
>
> Unpacking an archive hdfs://
> 50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz from
> /tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/pyspark_venv.tar.gz to
> /opt/spark/work-dir/./pyspark_venv.tar.gz
>
>  printing sys.path
> /tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538
> /tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/DSBQ.zip
> /opt/spark/python/lib/pyspark.zip
> /opt/spark/python/lib/py4j-0.10.9-src.zip
> /opt/spark/jars/spark-core_2.12-3.1.1.jar
> /usr/lib/python37.zip
> /usr/lib/python3.7
> /usr/lib/python3.7/lib-dynload
> /usr/local/lib/python3.7/dist-packages
> /usr/lib/python3/dist-packages
>
>  Printing user_paths
> /tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/DSBQ.zip
> /opt/spark/python/lib/pyspark.zip
> /opt/spark/python/lib/py4j-0.10.9-src.zip
> /opt/spark/jars/spark-core_2.12-3.1.1.jar
>
>  python version
> 3.7.3 (default, Jan 22 2021, 20:04:44)
> [GCC 8.3.0]
>
> looping over pkg_resources.working_set
> setuptools 57.2.0
> pip 21.1.3
> wheel 0.32.3
> six 1.12.0
> SecretStorage 2.3.1
> pyxdg 0.25
> PyGObject 3.30.4
> pycrypto 2.6.1
> keyrings.alt 3.1.1
> keyring 17.1.1
> entrypoints 0.3
> cryptography 2.6.1
> asn1crypto 0.24.0
> Traceback (most recent call last):
>   File "/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/testpackages.py",
> line 24, in <module>
>     main()
>   File "/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/testpackages.py",
> line 21, in main
>     import pandas
> ModuleNotFoundError: No module named 'pandas'
>
>
> Adding that if I go inside the docker and do
>
>
> 185@4a6747d59ff2:/opt/spark/work-dir$ pip3 list
> Package       Version
> ------------- -------
> asn1crypto    0.24.0
> cryptography  2.6.1
> entrypoints   0.3
> keyring       17.1.1
> keyrings.alt  3.1.1
> pip           21.1.3
> pycrypto      2.6.1
> PyGObject     3.30.4
> pyxdg         0.25
> SecretStorage 2.3.1
> setuptools    57.2.0
> six           1.12.0
> wheel         0.32.3
>
>
> I don't get any external packages!
>
>
> I opened a SO thead for this as well.
>
>
>
> https://stackoverflow.com/questions/68461865/unpacking-and-using-external-modules-with-pyspark-inside-kubernetes
>
>
> Do I need to hack Dockerfile to install the requirement.txt etc?
>
>
> Thanks
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> ---------- Forwarded message ---------
> From: Mich Talebzadeh <mich.talebza...@gmail.com>
> Date: Tue, 20 Jul 2021 at 22:51
> Subject: Unpacking and using external modules with PySpark inside k8s
> To: user @spark <u...@spark.apache.org>
>
>
>
> I have been struggling with this.
>
>
> Kubernetes (not that matters minikube is working fine. In one of the
> module called configure.py  I am importing yaml module
>
>
> import yaml
>
>
> This is throwing errors
>
>
>     import yaml
> ModuleNotFoundError: No module named 'yaml'
>
>
> I have been through a number of loops.
>
>
> First I created  virtual environment pyspark_venv.tar.gz that includes
> yaml module and past it to spark-submit as follows
>
>
> + spark-submit --verbose --master k8s://192.168.49.2:8443
> '--archives=hdfs://
> 50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz#pyspark_venv'
> --deploy-mode cluster --name pytest --conf
> 'spark.kubernetes.namespace=spark' --conf 'spark.executor.instances=1'
> --conf 'spark.kubernetes.driver.limit.cores=1' --conf
> 'spark.executor.cores=1' --conf 'spark.executor.memory=500m' --conf
> 'spark.kubernetes.container.image=pytest-repo/spark-py:3.1.1' --conf
> 'spark.kubernetes.authenticate.driver.serviceAccountName=spark-serviceaccount'
> --py-files hdfs://50.140.197.220:9000/minikube/codes/DSBQ.zip hdfs://
> 50.140.197.220:9000/minikube/codes/testyml.py
>
>
> Parsed arguments:
>   master                  k8s://192.168.49.2:8443
>   deployMode              cluster
>   executorMemory          500m
>   executorCores           1
>   totalExecutorCores      null
>   propertiesFile          /opt/spark/conf/spark-defaults.conf
>   driverMemory            null
>   driverCores             null
>   driverExtraClassPath    $SPARK_HOME/jars/*.jar
>   driverExtraLibraryPath  null
>   driverExtraJavaOptions  null
>   supervise               false
>   queue                   null
>   numExecutors            1
>   files                   null
>   pyFiles                 hdfs://
> 50.140.197.220:9000/minikube/codes/DSBQ.zip
>   archives                hdfs://
> 50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz#pyspark_venv
>   mainClass               null
>   primaryResource         hdfs://
> 50.140.197.220:9000/minikube/codes/testyml.py
>   name                    pytest
>   childArgs               []
>   jars                    null
>   packages                null
>   packagesExclusions      null
>   repositories            null
>   verbose                 true
>
>
> Unpacking an archive hdfs://
> 50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz#pyspark_venv from
> /tmp/spark-d339a76e-090c-4670-89aa-da723d6e9fbc/pyspark_venv.tar.gz to
> /opt/spark/work-dir/./pyspark_venv
>
>
> printing sys.path
> /tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc
> /tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/DSBQ.zip
> /opt/spark/python/lib/pyspark.zip
> /opt/spark/python/lib/py4j-0.10.9-src.zip
> /opt/spark/jars/spark-core_2.12-3.1.1.jar
> /usr/lib/python37.zip
> /usr/lib/python3.7
> /usr/lib/python3.7/lib-dynload
> /usr/local/lib/python3.7/dist-packages
> /usr/lib/python3/dist-packages
>
>  Printing user_paths
> ['/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/DSBQ.zip',
> '/opt/spark/python/lib/pyspark.zip',
> '/opt/spark/python/lib/py4j-0.10.9-src.zip',
> '/opt/spark/jars/spark-core_2.12-3.1.1.jar']
> checking yaml
> Traceback (most recent call last):
>   File "/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/testyml.py", line
> 18, in <module>
>     main()
>   File "/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/testyml.py", line
> 15, in main
>     import yaml
> ModuleNotFoundError: No module named 'yaml'
>
>
> Well it does not matter if it is yaml or numpy. It just cannot find the
> modules. How can I find out if the gz file is unpacked OK?
>
>
> Thanks
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Re: Unpacking and using external modules with PySpark inside k8s

Reply via email to