Re: Performance of PySpark jobs on the Kubernetes cluster

2021-08-14 Thread Mich Talebzadeh
Hi Khalid and David. Thanks for your comments. I believe I found out the source of High CPU utilisation on host submitting spark-submit where I referred to as launch node This node was the master in what is known as Google Dataproc cluster. According to this link

Re: Performance of PySpark jobs on the Kubernetes cluster

2021-08-11 Thread David Diebold
Hi Mich, I don't quite understand why the driver node is using so much CPU, but it may be unrelated to your executors being underused. About your executors being underused, I would check that your job generated enough tasks. Then I would check spark.executor.cores and spark.tasks.cpus parameters t

Re: Performance of PySpark jobs on the Kubernetes cluster

2021-08-10 Thread Khalid Mammadov
Hi Mich I think you need to check your code. If code does not use PySpark API effectively you may get this. I.e. if you use pure Python/pandas api rather than Pyspark i.e. transform->transform->action. e.g df.select(..).withColumn(...)...count() Hope this helps to put you on right direction. Che