PySpark dependency management in minikube on prem

Mich Talebzadeh Mon, 28 Jun 2021 02:13:41 -0700

Hi,


Using Minikube to create a containerised Spark, I can easily use spark
submit below with uber jar file


bin/spark-submit \
--master k8s://$KSERVER \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=3 \
--conf spark.kubernetes.namespace=spark \
--conf spark.kubernetes.driver.pod.name=spark-pi-driver \
--conf spark.kubernetes.container.image=spark:latest \
--conf
spark.kubernetes.authenticate.driver.serviceAccountName=spark-serviceaccount
\
local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar


This works fine and comes back with Pi is roughly 3.1380356901784507


For Scala this submission is easy because when you write Spark code in
Scala you can bundle your dependencies in the jar file that you submit to
Spark namely
local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar


However, when writing Spark code in Python, dependency management becomes
more difficult because each of the Spark executor nodes  performing
computations needs to have all of the Python dependencies installed


This is normally resolved by creating a dependency.zip file from
site-packages under Python virtual environment


/usr/src/Python-3.7.3/airflow_virtualenv/lib/python3.7/site-packages

zip -r ../dependencies.zip .


Then I can use that zip file on prem


spark-submit --master local[4] \

    --py-files
local:///usr/src/Python-.7.3/airflow_virtualenv/lib/python3.7/dependencies.zip
\  local:///opt/spark/examples/src/main/python/pi.py


However, how one uses that dependency in minikube.


This is the submission code


bin/spark-submit --verbose \
--master k8s://$KSERVER \
--deploy-mode client \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.namespace=spark \
--conf spark.kubernetes.driver.pod.name=spark-pi-driver \
--conf spark.kubernetes.container.image=spark:latest \
--conf
spark.kubernetes.authenticate.driver.serviceAccountName=spark-serviceaccount
\
--py-files=local:///usr/src/Python-3.7.3/airflow_virtualenv/lib/python3.7/dependencies.zip
\
local:///opt/spark/examples/src/main/python/pi.py


Throwing error


Exception in thread "main" org.apache.spark.SparkException: Failed to get
main class in JAR with error 'File
file:/d4T/hduser/spark-3.1.1-bin-hadoop3.2/  does not exist'.  Please
specify one with --class.
        at org.apache.spark.deploy.SparkSubmit.error(SparkSubmit.scala:959)
        at
org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:486)
        at org.apache.spark.deploy.SparkSubmit.org
$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
        at
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at
org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
./pyspark-minikube.sh[55]: --name: not found [No such file or directory]


How can one resolve this issue?


Thanks



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

PySpark dependency management in minikube on prem

Reply via email to