We are currently using Dataproc on GCP for running our spark workloads, and i'm planning to move this workload to Kubernetes(GKE).
Here is what is done so far : Installed Spark using bitnami helm chart: ``` helm repo add bitnami https://charts.bitnami.com/bitnami helm install spark -f sparkConfig.yaml bitnami/spark -n spark ``` Also, deployed a loadbalancer, yaml used : ``` apiVersion: v1kind: Servicemetadata: name: spark-master-lb labels: app: spark component: LoadBalancerspec: selector: app.kubernetes.io/component: master app.kubernetes.io/instance: spark app.kubernetes.io/name: spark ports: - name: webui port: 8080 targetPort: 8080 - name: master port: 7077 targetPort: 7077 type: LoadBalancer ``` Spark is installed, and the pods have come up. When i try to do a spark-submit in cluster mode, it gives following error: ``` (base) Karans-MacBook-Pro:fromEdward-jan26 karanalang$ $SPARK_HOME/bin/spark-submit --master spark://<EXTERNAL_IP>:7077 --deploy-mode cluster --name spark-on-gke local:///Users/karanalang/Documents/Technology/0.spark-on-gke/StructuredStream-on-gke.py24/08/26 12:03:26 WARN Utils: Your hostname, Karans-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.42.28.138 instead (on interface en0)24/08/26 12:03:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another addressWARNING: An illegal reflective access operation has occurredWARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/Users/karanalang/Documents/Technology/spark-3.1.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.1.3.jar) to constructor java.nio.DirectByteBuffer(long,int)WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.PlatformWARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operationsWARNING: All illegal access operations will be denied in a future release Exception in thread "main" org.apache.spark.SparkException: Cluster deploy mode is currently not supported for python applications on standalone clusters. at org.apache.spark.deploy.SparkSubmit.error(SparkSubmit.scala:968) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:273) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ``` In client mode, it gives the following error : 24/08/26 12:06:58 ERROR SparkContext: Error initializing SparkContext. java.lang.NullPointerException at org.apache.spark.SparkContext.<init>(SparkContext.scala:640) at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:238) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.base/java.lang.Thread.run(Thread.java:829)24/08/26 12:06:58 INFO SparkContext: SparkContext already stopped. Couple of questions : 1. is using the helm chart the correct way to install Apache Spark on GKE/k8s (Note - need to install on both GKE and On-prem kubernetes) 2. How to submit pyspark jobs on Spark cluster deployed on GKE (eg. Do I need to create a K8s deployment for each spark job ?) tia ! Here is the stackoverflow link : https://stackoverflow.com/questions/78915988/unable-to-deploy-pyspark-application-on-gke-spark-installed-using-bitnami-helm