Re: Performance of PySpark jobs on the Kubernetes cluster

Mich Talebzadeh Sat, 14 Aug 2021 14:08:17 -0700

Hi Khalid and David.

Thanks for your comments. I believe I found out the source of High CPU
utilisation on host submitting spark-submit where I referred to as launch
node


This node was the master in what is known as Google Dataproc cluster.
According to this link <https://cloud.google.com/dataproc> "Dataproc is a
fully managed and highly scalable service for running Apache Spark, Apache
Flink, Presto, and 30+ open source tools and frameworks. Use Dataproc for
data lake modernization, ETL, and secure data science, at planet scale,
fully integrated with Google Cloud, at a fraction of the cost."


So it comes with Spark, Hadoop, Hive etc installed. What was happening when
I did the old but useful *top command with nothing running,* I noticed a
lot of things were running in the background. The largest consumer of CPU
was Spark history server that seems to run as a daemon in the background
checking for spark jobs even though none  was running at that time. So
effectively with background events around 62% of CPU was wasted.


So I went and built a VM compute server on GCP with 2 CPUS and 8GB of RAM.
It only comes with a debian buster installed. Downloaded Spark
3.1.1-hadopp3.2, Java 8, kubectl, docker and anything else and created the
environment for Kubernetes run. The python job basically creates 10,000
rows of random data in Spark and posts these data into Google BigQuery
(GBQ) table. The docker image I built  was with spark 3.1.1 and Java 8 as
Java 11 is not compatible writing to BQT using Spark-BigQuery API.


Now everything looks much healthier as CPU utilisation diagram for the vm
host (green called ctpvm) and GKS engines (other colours) shows. The vm
host uses a tiny 0.6% of CPU followed by the GKE master node at 6.25% and
executors below 4.5%. Actually one executor (asked for three)  was not
utilised at all, which makes sense as it is not needed


k get pods -n spark

NAME                                         READY   STATUS    RESTARTS
 AGE

pytest-a772f47b4677cf6e-driver               1/1     Running   0
20s

randomdatabigquery-acb66f7b46780d3b-exec-1   1/1     Running   0          2s

randomdatabigquery-acb66f7b46780d3b-exec-2   1/1     Running   0          2s

randomdatabigquery-acb66f7b46780d3b-exec-3   0/1    * Pending*   0
2s




[image: image.png]


Anyway I think it all adds up now.


Cheers


Mich




   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 11 Aug 2021 at 13:17, David Diebold <davidjdieb...@gmail.com> wrote:

> Hi Mich,
>
> I don't quite understand why the driver node is using so much CPU, but it
> may be unrelated to your executors being underused.
> About your executors being underused, I would check that your job
> generated enough tasks.
> Then I would check spark.executor.cores and spark.tasks.cpus parameters to
> see if I can give more work to the executors.
>
> Cheers,
> David
>
>
>
> Le mar. 10 août 2021 à 12:20, Khalid Mammadov <khalidmammad...@gmail.com>
> a écrit :
>
>> Hi Mich
>>
>> I think you need to check your code.
>> If code does not use PySpark API effectively you may get this. I.e. if
>> you use pure Python/pandas api rather than Pyspark i.e.
>> transform->transform->action. e.g df.select(..).withColumn(...)...count()
>>
>> Hope this helps to put you on right direction.
>>
>> Cheers
>> Khalid
>>
>>
>>
>>
>> On Mon, 9 Aug 2021, 20:20 Mich Talebzadeh, <mich.talebza...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I have a basic question to ask.
>>>
>>> I am running a Google k8s cluster (AKA GKE) with three nodes each
>>> having  configuration below
>>>
>>> e2-standard-2 (2 vCPUs, 8 GB memory)
>>>
>>>
>>> spark-submit is launched from another node (actually a data proc single
>>> node that I have just upgraded to e2-custom (4 vCPUs, 8 GB mem). We
>>> call this the launch node
>>>
>>> OK I know that the cluster is not much but Google was complaining about
>>> the launch node hitting 100% cpus. So I added two more cpus to it.
>>>
>>> It appears that despite using k8s as the computational cluster, the
>>> burden falls upon the launch node!
>>>
>>> The cpu utilisation for launch node shown below
>>>
>>> [image: image.png]
>>> The dip is when 2 more cpus were added to  it so it had to reboot. so
>>> around %70 usage
>>>
>>> The combined cpu usage for GKE nodes is shown below:
>>>
>>> [image: image.png]
>>>
>>> Never goes above 20%!
>>>
>>> I can see that the drive and executors as below:
>>>
>>> k get pods -n spark
>>> NAME                                         READY   STATUS    RESTARTS
>>>  AGE
>>> pytest-c958c97b2c52b6ed-driver               1/1     Running   0
>>>   69s
>>> randomdatabigquery-e68a8a7b2c52f468-exec-1   1/1     Running   0
>>>   51s
>>> randomdatabigquery-e68a8a7b2c52f468-exec-2   1/1     Running   0
>>>   51s
>>> randomdatabigquery-e68a8a7b2c52f468-exec-3   0/1     Pending   0
>>>   51s
>>>
>>> It is a PySpark 3.1.1 image using java 8 and pushing random data
>>> generated into Google BigQuery data warehouse. The last executor (exec-3)
>>> seems to be just pending. The spark-submit is as below:
>>>
>>>         spark-submit --verbose \
>>>            --properties-file ${property_file} \
>>>            --master k8s://https://$KUBERNETES_MASTER_IP:443 \
>>>            --deploy-mode cluster \
>>>            --name pytest \
>>>            --conf
>>> spark.yarn.appMasterEnv.PYSPARK_PYTHON=./pyspark_venv/bin/python \
>>>            --py-files $CODE_DIRECTORY/DSBQ.zip \
>>>            --conf spark.kubernetes.namespace=$NAMESPACE \
>>>            --conf spark.executor.memory=5000m \
>>>            --conf spark.network.timeout=300 \
>>>            --conf spark.executor.instances=3 \
>>>            --conf spark.kubernetes.driver.limit.cores=1 \
>>>            --conf spark.driver.cores=1 \
>>>            --conf spark.executor.cores=1 \
>>>            --conf spark.executor.memory=2000m \
>>>            --conf spark.kubernetes.driver.docker.image=${IMAGEGCP} \
>>>            --conf spark.kubernetes.executor.docker.image=${IMAGEGCP} \
>>>            --conf spark.kubernetes.container.image=${IMAGEGCP} \
>>>            --conf
>>> spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \
>>>            --conf
>>> spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
>>>            --conf
>>> spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
>>> \
>>>            --conf spark.sql.execution.arrow.pyspark.enabled="true" \
>>>            $CODE_DIRECTORY/${APPLICATION}
>>>
>>> Aren't the driver and executors running on K8s cluster? So why is the
>>> launch node heavily used but k8s cluster is underutilized?
>>>
>>> Thanks
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>

Re: Performance of PySpark jobs on the Kubernetes cluster

Reply via email to