Thanks Julien for further info. I have been working a few day fee time on Pyspark on Kubernetes both on minikube and Google Cloud Platform (GCP) that provide Spark on Google Kubernetes Engine (GKE). Frankly my work on k8s has been a bit disappointing.
In GCP the only available and supported docker is built on spark-2.3.0-bin-hadoop2.7 which I believe other cloud vendors provide similar but nothing in 3.x.x range. Using Spark on Kubernetes Engine to Process Data in BigQuery <https://cloud.google.com/architecture/spark-on-kubernetes-engine> And the sample project is not working. With regard to shuffling there is some literature about using local storage for shuffling in Kubernetes. For example see here <https://spark.apache.org/docs/latest/running-on-kubernetes.html#local-storag> For my minikube I built docker on Spark 3.1.1 for PySpark but having a number of configuration issues that have dampened my enthusiasm and the inevitable question regardless of performance, can Kubernetes be used in anger today at industrial scale. Regards, Mich view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Fri, 23 Jul 2021 at 14:24, Julien Laurenceau < julien.laurenc...@pepitedata.com> wrote: > Hi, > > Good question ! > It is very dependent to your jobs and developer team. > > Things that mostly differ in my view is : > 1/ data locality & fast-read > If your data are stored in an HDFS cluster (not HCFS) and your Spark > compute nodes are allowed to run on the Hadoop nodes, then definitely use > Yarn to benefit from fast-read and better data locality > 2/ shuffle > Last time I checked, there was an issue because Spark see its K8S > container with a distinct IP, it does not know which containers are on the > same host. > Then it cannot optimize shuffle to avoid network traffic between host. > If your spark tasks do not use the full size of the host, shuffle will not > work at its best and locality levels NODE_LOCAL, RACK_LOCAL will be useless. > > Lastly, I tend to think that if data locality & fast-read is not an issue > and shuffle is not an issue for your solution, then optimizing data access > does not matter and you are probably doing mostly CPU intensive jobs. In > this way you could consider using other alternatives to Spark especially if > you were considering pySpark (Spark runs on JVM). Have a look for example > to python-dask. > > Regards, > Julien > > Le mar. 6 juil. 2021 à 11:23, Mich Talebzadeh <mich.talebza...@gmail.com> > a écrit : > >> I had a chance to look at this paper. >> >> I have reservations about this benchmark. They have used Google Dataproc >> which you can create a cluster of it with Hadoop and Spark (they used >> Spark 3) and decides on the number of worker nodes >> >> This is the layout of their set up >> Setup >> >> This benchmark compares Spark running Data Mechanics >> <https://www.datamechanics.co/> (deployed on Google Kubernetes Engine >> <https://cloud.google.com/kubernetes-engine>), and Spark running on >> Dataproc <https://cloud.google.com/dataproc> (GCP's managed Hadoop >> offering). >> >> Driver: n2-standard-4 instance >> >> - 4 vCPUs >> - 16GB RAM >> >> 5 executors on n2-highmem-4 instances >> >> - 4 vCPUs >> - 32GB RAM >> - 375GB local SSD >> >> >> It is true that you can attach SSDs to Dataproc nodes and Kubernetes >> cluster nodes respectively. These would be local SSDs .The paper states >> 10TB of test data used so somehow they decided to distribute 10TB of data >> across these local SSDs. >> >> >> Real life does not work like this I am afraid. If you are going to run >> Spark on data somewhere, the likelihood is that the data is stored on HDFS, >> Cloud storage buckets (HCFS) or some tables in Cloud databases. So why >> bring in SSDs here. From the Google link here >> <https://cloud.google.com/compute/docs/disks/local-ssd>and I quote "Local >> SSDs are suitable only for temporary storage such as caches, processing >> space, or low value data. To store data that is not temporary or ephemeral >> in nature, use one of our durable storage options >> <https://cloud.google.com/compute/docs/disks>." 10TB of data, you might >> as well store it on Cloud storage and run your spark tests looking at that >> data both with Dataproc cluster and Kubernetes cluster. IMO this is not >> a qualified benchmark I am afraid. >> >> >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Mon, 5 Jul 2021 at 20:27, Madaditya .Maddy <w47snea...@gmail.com> >> wrote: >> >>> I came across an article that benchmarked spark on k8s vs yarn by >>> Datamechanics. >>> >>> Link : >>> https://www.datamechanics.co/blog-post/apache-spark-performance-benchmarks-show-kubernetes-has-caught-up-with-yarn >>> >>> -Regards >>> Aditya >>> >>> On Mon, Jul 5, 2021, 23:49 Mich Talebzadeh <mich.talebza...@gmail.com> >>> wrote: >>> >>>> Thanks Yuri. Those are very valid points. >>>> >>>> Let me clarify my point. Let us assume that we will be using Yarn >>>> versus K8s doing the same job. Spark-submit will use Yarn at first instance >>>> and will then switch to using k8s for the same task. >>>> >>>> >>>> 1. Have there been such benchmarks? >>>> 2. When should I choose PaaS versus k8s for example for small to >>>> medium size jobs >>>> 3. I can see the flexibility of running Spark on Dataproc, then >>>> some may argue that k8s are the way forward >>>> 4. Bear in mind that I am only considering Spark. For example for >>>> Kafka and Zookeeper we opt for dockers as they do a single function. >>>> >>>> >>>> Cheers, >>>> >>>> Mich >>>> >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> On Mon, 5 Jul 2021 at 19:06, "Yuri Oleynikov (יורי אולייניקוב)" < >>>> yur...@gmail.com> wrote: >>>> >>>>> Not a big expert on Spark, but I’m not really understand how you are >>>>> going to compare and what? Reading-writing to and from Hdfs? How does it >>>>> related to yarn and k8s… these are recourse managers (YARN yet another >>>>> resource manager) : what and how much to allocate and when… (cpu, ram). >>>>> Local Disk spilling? Depends on disk throughput… >>>>> So what you are going to measure? >>>>> >>>>> >>>>> >>>>> >>>>> Best regards >>>>> >>>>> On 5 Jul 2021, at 20:43, Mich Talebzadeh <mich.talebza...@gmail.com> >>>>> wrote: >>>>> >>>>> >>>>> >>>>> I was curious to know if there are benchmarks around on comparison >>>>> between Spark on Yarn compared to Kubernetes. >>>>> >>>>> >>>>> This question arose because traditionally in Google Cloud we have been >>>>> using Spark on Dataproc clusters. Dataproc >>>>> <https://cloud.google.com/dataproc> provides Spark, Hadoop plus >>>>> others (optional install) for data and analytic processing. It is PaaS >>>>> >>>>> >>>>> Now they have GKE clusters as well and also introduced Apache Spark >>>>> with Cloud Dataproc on Kubernetes >>>>> <https://cloud.google.com/blog/products/data-analytics/modernize-apache-spark-with-cloud-dataproc-on-kubernetes> >>>>> which >>>>> allows one to submit Spark jobs to k8s using Dataproc stub as a platform >>>>> to >>>>> submit the job as below from cloud console or local >>>>> >>>>> >>>>> gcloud dataproc jobs submit pyspark --cluster="dataproc-for-gke" >>>>> gs://bucket/testme.py --region="europe-west2" --py-files >>>>> gs://bucket/DSBQ.zip >>>>> Job [e5fc19b62cf744f0b13f3e6d9cc66c19] submitted. >>>>> Waiting for job output... >>>>> >>>>> >>>>> At the moment it is a struggle to see what merits using k8s instead of >>>>> dataproc bar notebooks etc. Actually there is not much literature around >>>>> with PySpark on k8s. >>>>> >>>>> >>>>> For me Spark on bare metal is the preferred option as I cannot see how >>>>> one can pigeon hole Spark into a container and make it performant but I >>>>> may >>>>> be totally wrong. >>>>> >>>>> >>>>> Thanks >>>>> >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>>