I came across an article that benchmarked spark on k8s vs yarn by
Datamechanics.

Link :
https://www.datamechanics.co/blog-post/apache-spark-performance-benchmarks-show-kubernetes-has-caught-up-with-yarn

-Regards
Aditya

On Mon, Jul 5, 2021, 23:49 Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thanks Yuri. Those are very valid points.
>
> Let me clarify my point. Let us assume that we will be using Yarn versus
> K8s doing the same job. Spark-submit will use Yarn at first instance and
> will then switch to using k8s for the same task.
>
>
>    1. Have there been such benchmarks?
>    2. When should I choose PaaS versus k8s for example for small to
>    medium size jobs
>    3. I can see the flexibility of running Spark on Dataproc, then some
>    may argue that k8s are the way forward
>    4. Bear in mind that I am only considering Spark. For example for
>    Kafka and Zookeeper we opt for dockers as they do a single function.
>
>
> Cheers,
>
> Mich
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> ‪On Mon, 5 Jul 2021 at 19:06, ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ <
> yur...@gmail.com> wrote:‬
>
>> Not a big expert on Spark, but I’m not really understand how you are
>> going to compare and what? Reading-writing to and from Hdfs? How does it
>> related to yarn and k8s… these are recourse managers (YARN yet another
>> resource manager) : what and how much to allocate and when… (cpu, ram).
>> Local Disk spilling? Depends on disk throughput…
>> So what you are going to measure?
>>
>>
>>
>>
>> Best regards
>>
>> On 5 Jul 2021, at 20:43, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> 
>>
>> I was curious to know if there are benchmarks around on comparison
>> between Spark on Yarn compared to Kubernetes.
>>
>>
>> This question arose because traditionally in Google Cloud we have been
>> using Spark on Dataproc clusters. Dataproc
>> <https://cloud.google.com/dataproc>  provides Spark, Hadoop plus others
>> (optional install) for data and analytic processing. It is PaaS
>>
>>
>> Now they have GKE clusters as well and also introduced Apache Spark with
>> Cloud Dataproc on Kubernetes
>> <https://cloud.google.com/blog/products/data-analytics/modernize-apache-spark-with-cloud-dataproc-on-kubernetes>
>>  which
>> allows one to submit Spark jobs to k8s using Dataproc stub as a platform to
>> submit the job as below from cloud console or local
>>
>>
>> gcloud dataproc jobs submit pyspark --cluster="dataproc-for-gke"
>> gs://bucket/testme.py --region="europe-west2" --py-files
>> gs://bucket/DSBQ.zip
>> Job [e5fc19b62cf744f0b13f3e6d9cc66c19] submitted.
>> Waiting for job output...
>>
>>
>> At the moment it is a struggle to see what merits using k8s instead of
>> dataproc bar notebooks etc. Actually there is not much literature around
>> with PySpark on k8s.
>>
>>
>> For me Spark on bare metal is the preferred option as I cannot see how
>> one can pigeon hole Spark into a container and make it performant but I may
>> be totally wrong.
>>
>>
>> Thanks
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>

Reply via email to