I came across an article that benchmarked spark on k8s vs yarn by Datamechanics.
Link : https://www.datamechanics.co/blog-post/apache-spark-performance-benchmarks-show-kubernetes-has-caught-up-with-yarn -Regards Aditya On Mon, Jul 5, 2021, 23:49 Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Thanks Yuri. Those are very valid points. > > Let me clarify my point. Let us assume that we will be using Yarn versus > K8s doing the same job. Spark-submit will use Yarn at first instance and > will then switch to using k8s for the same task. > > > 1. Have there been such benchmarks? > 2. When should I choose PaaS versus k8s for example for small to > medium size jobs > 3. I can see the flexibility of running Spark on Dataproc, then some > may argue that k8s are the way forward > 4. Bear in mind that I am only considering Spark. For example for > Kafka and Zookeeper we opt for dockers as they do a single function. > > > Cheers, > > Mich > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Mon, 5 Jul 2021 at 19:06, "Yuri Oleynikov (יורי אולייניקוב)" < > yur...@gmail.com> wrote: > >> Not a big expert on Spark, but I’m not really understand how you are >> going to compare and what? Reading-writing to and from Hdfs? How does it >> related to yarn and k8s… these are recourse managers (YARN yet another >> resource manager) : what and how much to allocate and when… (cpu, ram). >> Local Disk spilling? Depends on disk throughput… >> So what you are going to measure? >> >> >> >> >> Best regards >> >> On 5 Jul 2021, at 20:43, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >> >> >> I was curious to know if there are benchmarks around on comparison >> between Spark on Yarn compared to Kubernetes. >> >> >> This question arose because traditionally in Google Cloud we have been >> using Spark on Dataproc clusters. Dataproc >> <https://cloud.google.com/dataproc> provides Spark, Hadoop plus others >> (optional install) for data and analytic processing. It is PaaS >> >> >> Now they have GKE clusters as well and also introduced Apache Spark with >> Cloud Dataproc on Kubernetes >> <https://cloud.google.com/blog/products/data-analytics/modernize-apache-spark-with-cloud-dataproc-on-kubernetes> >> which >> allows one to submit Spark jobs to k8s using Dataproc stub as a platform to >> submit the job as below from cloud console or local >> >> >> gcloud dataproc jobs submit pyspark --cluster="dataproc-for-gke" >> gs://bucket/testme.py --region="europe-west2" --py-files >> gs://bucket/DSBQ.zip >> Job [e5fc19b62cf744f0b13f3e6d9cc66c19] submitted. >> Waiting for job output... >> >> >> At the moment it is a struggle to see what merits using k8s instead of >> dataproc bar notebooks etc. Actually there is not much literature around >> with PySpark on k8s. >> >> >> For me Spark on bare metal is the preferred option as I cannot see how >> one can pigeon hole Spark into a container and make it performant but I may >> be totally wrong. >> >> >> Thanks >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >>