I managed to sort this one out. Please see
https://stackoverflow.com/questions/68461865/unpacking-and-using-external-modules-with-pyspark-inside-kubernetes/68476548#68476548 HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Wed, 21 Jul 2021 at 18:10, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > Hi, > > I am aware that some fellow members in this dev group were involved in > creating scripts for running spark on kubernetes > > # To build additional PySpark docker image$ ./bin/docker-image-tool.sh -r > <repo> -t my-tag -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile > build > > > The problem I have explained is to be able to unpack packages like yaml > and pandas inside k8s > > > I am using > > > spark-submit --verbose \ > --master k8s://$K8S_SERVER \ > > --archives=hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/pyspark_venv.tar.gz > \ > --deploy-mode cluster \ > --name pytest \ > --conf spark.kubernetes.namespace=spark \ > --conf spark.executor.instances=1 \ > --conf spark.kubernetes.driver.limit.cores=1 \ > --conf spark.executor.cores=1 \ > --conf spark.executor.memory=500m \ > --conf spark.kubernetes.container.image=${IMAGE} \ > --conf > spark.kubernetes.authenticate.driver.serviceAccountName=spark-serviceaccount > \ > --py-files > hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/DSBQ.zip \ > hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/${APPLICATION} > > > The directory containing code is zipped as DSBQ.zip and it reads it ok. > > > However, it says in verbose mode > > > 2021-07-21 17:01:29,038 WARN util.NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > Unpacking an archive hdfs:// > 50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz from > /tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/pyspark_venv.tar.gz to > /opt/spark/work-dir/./pyspark_venv.tar.gz > > > In this case it tries to use pandas > > > The module ${APPLICATION} has this code > > > import sys > import os > import pkgutil > import pkg_resources > > def main(): > print("\n printing sys.path") > for p in sys.path: > print(p) > user_paths = os.environ['PYTHONPATH'].split(os.pathsep) > print("\n Printing user_paths") > for p in user_paths: > print(p) > v = sys.version > print("\n python version") > print(v) > print("\nlooping over pkg_resources.working_set") > for r in pkg_resources.working_set: > print(r) > import pandas > > if __name__ == "__main__": > main() > > > The output is shown below > > Unpacking an archive hdfs:// > 50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz from > /tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/pyspark_venv.tar.gz to > /opt/spark/work-dir/./pyspark_venv.tar.gz > > printing sys.path > /tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538 > /tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/DSBQ.zip > /opt/spark/python/lib/pyspark.zip > /opt/spark/python/lib/py4j-0.10.9-src.zip > /opt/spark/jars/spark-core_2.12-3.1.1.jar > /usr/lib/python37.zip > /usr/lib/python3.7 > /usr/lib/python3.7/lib-dynload > /usr/local/lib/python3.7/dist-packages > /usr/lib/python3/dist-packages > > Printing user_paths > /tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/DSBQ.zip > /opt/spark/python/lib/pyspark.zip > /opt/spark/python/lib/py4j-0.10.9-src.zip > /opt/spark/jars/spark-core_2.12-3.1.1.jar > > python version > 3.7.3 (default, Jan 22 2021, 20:04:44) > [GCC 8.3.0] > > looping over pkg_resources.working_set > setuptools 57.2.0 > pip 21.1.3 > wheel 0.32.3 > six 1.12.0 > SecretStorage 2.3.1 > pyxdg 0.25 > PyGObject 3.30.4 > pycrypto 2.6.1 > keyrings.alt 3.1.1 > keyring 17.1.1 > entrypoints 0.3 > cryptography 2.6.1 > asn1crypto 0.24.0 > Traceback (most recent call last): > File "/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/testpackages.py", > line 24, in <module> > main() > File "/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/testpackages.py", > line 21, in main > import pandas > ModuleNotFoundError: No module named 'pandas' > > > Adding that if I go inside the docker and do > > > 185@4a6747d59ff2:/opt/spark/work-dir$ pip3 list > Package Version > ------------- ------- > asn1crypto 0.24.0 > cryptography 2.6.1 > entrypoints 0.3 > keyring 17.1.1 > keyrings.alt 3.1.1 > pip 21.1.3 > pycrypto 2.6.1 > PyGObject 3.30.4 > pyxdg 0.25 > SecretStorage 2.3.1 > setuptools 57.2.0 > six 1.12.0 > wheel 0.32.3 > > > I don't get any external packages! > > > I opened a SO thead for this as well. > > > > https://stackoverflow.com/questions/68461865/unpacking-and-using-external-modules-with-pyspark-inside-kubernetes > > > Do I need to hack Dockerfile to install the requirement.txt etc? > > > Thanks > > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > ---------- Forwarded message --------- > From: Mich Talebzadeh <mich.talebza...@gmail.com> > Date: Tue, 20 Jul 2021 at 22:51 > Subject: Unpacking and using external modules with PySpark inside k8s > To: user @spark <u...@spark.apache.org> > > > > I have been struggling with this. > > > Kubernetes (not that matters minikube is working fine. In one of the > module called configure.py I am importing yaml module > > > import yaml > > > This is throwing errors > > > import yaml > ModuleNotFoundError: No module named 'yaml' > > > I have been through a number of loops. > > > First I created virtual environment pyspark_venv.tar.gz that includes > yaml module and past it to spark-submit as follows > > > + spark-submit --verbose --master k8s://192.168.49.2:8443 > '--archives=hdfs:// > 50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz#pyspark_venv' > --deploy-mode cluster --name pytest --conf > 'spark.kubernetes.namespace=spark' --conf 'spark.executor.instances=1' > --conf 'spark.kubernetes.driver.limit.cores=1' --conf > 'spark.executor.cores=1' --conf 'spark.executor.memory=500m' --conf > 'spark.kubernetes.container.image=pytest-repo/spark-py:3.1.1' --conf > 'spark.kubernetes.authenticate.driver.serviceAccountName=spark-serviceaccount' > --py-files hdfs://50.140.197.220:9000/minikube/codes/DSBQ.zip hdfs:// > 50.140.197.220:9000/minikube/codes/testyml.py > > > Parsed arguments: > master k8s://192.168.49.2:8443 > deployMode cluster > executorMemory 500m > executorCores 1 > totalExecutorCores null > propertiesFile /opt/spark/conf/spark-defaults.conf > driverMemory null > driverCores null > driverExtraClassPath $SPARK_HOME/jars/*.jar > driverExtraLibraryPath null > driverExtraJavaOptions null > supervise false > queue null > numExecutors 1 > files null > pyFiles hdfs:// > 50.140.197.220:9000/minikube/codes/DSBQ.zip > archives hdfs:// > 50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz#pyspark_venv > mainClass null > primaryResource hdfs:// > 50.140.197.220:9000/minikube/codes/testyml.py > name pytest > childArgs [] > jars null > packages null > packagesExclusions null > repositories null > verbose true > > > Unpacking an archive hdfs:// > 50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz#pyspark_venv from > /tmp/spark-d339a76e-090c-4670-89aa-da723d6e9fbc/pyspark_venv.tar.gz to > /opt/spark/work-dir/./pyspark_venv > > > printing sys.path > /tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc > /tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/DSBQ.zip > /opt/spark/python/lib/pyspark.zip > /opt/spark/python/lib/py4j-0.10.9-src.zip > /opt/spark/jars/spark-core_2.12-3.1.1.jar > /usr/lib/python37.zip > /usr/lib/python3.7 > /usr/lib/python3.7/lib-dynload > /usr/local/lib/python3.7/dist-packages > /usr/lib/python3/dist-packages > > Printing user_paths > ['/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/DSBQ.zip', > '/opt/spark/python/lib/pyspark.zip', > '/opt/spark/python/lib/py4j-0.10.9-src.zip', > '/opt/spark/jars/spark-core_2.12-3.1.1.jar'] > checking yaml > Traceback (most recent call last): > File "/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/testyml.py", line > 18, in <module> > main() > File "/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/testyml.py", line > 15, in main > import yaml > ModuleNotFoundError: No module named 'yaml' > > > Well it does not matter if it is yaml or numpy. It just cannot find the > modules. How can I find out if the gz file is unpacked OK? > > > Thanks > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > >