I was curious to know if there are benchmarks around on comparison between
Spark on Yarn compared to Kubernetes.


This question arose because traditionally in Google Cloud we have been
using Spark on Dataproc clusters. Dataproc
<https://cloud.google.com/dataproc>  provides Spark, Hadoop plus others
(optional install) for data and analytic processing. It is PaaS


Now they have GKE clusters as well and also introduced Apache Spark with
Cloud Dataproc on Kubernetes
<https://cloud.google.com/blog/products/data-analytics/modernize-apache-spark-with-cloud-dataproc-on-kubernetes>
which
allows one to submit Spark jobs to k8s using Dataproc stub as a platform to
submit the job as below from cloud console or local


gcloud dataproc jobs submit pyspark --cluster="dataproc-for-gke"
gs://bucket/testme.py --region="europe-west2" --py-files
gs://bucket/DSBQ.zip
Job [e5fc19b62cf744f0b13f3e6d9cc66c19] submitted.
Waiting for job output...


At the moment it is a struggle to see what merits using k8s instead of
dataproc bar notebooks etc. Actually there is not much literature around
with PySpark on k8s.


For me Spark on bare metal is the preferred option as I cannot see how one
can pigeon hole Spark into a container and make it performant but I may be
totally wrong.


Thanks


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
  • Bechmarks on Spark... Mich Talebzadeh

Reply via email to