Re: Spark on Kubernetes scheduler variety

Lalwani, Jayesh Thu, 24 Jun 2021 08:26:07 -0700

You can always chain aggregations by chaining multiple Structured Streaming 
jobs. It’s not a showstopper.

Getting Spark on Kubernetes is important for organizations that want to pursue 
a multi-cloud strategy

From: Mich Talebzadeh <mich.talebza...@gmail.com>
Date: Wednesday, June 23, 2021 at 11:27 AM
To: "user @spark" <user@spark.apache.org>
Cc: dev <d...@spark.apache.org>
Subject: RE: [EXTERNAL] Spark on Kubernetes scheduler variety

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

Please allow me to be diverse and express a different point of view on this 
roadmap.

I believe from a technical point of view spending time and effort plus talent 
on batch scheduling on Kubernetes could be rewarding. However, if I may say I 
doubt whether such an approach and the so-called democratization of Spark on 
whatever platform is really should be of great focus.
Having worked on Google Dataproc<https://cloud.google.com/dataproc> (A fully 
managed and highly scalable service for running Apache Spark, Hadoop and more 
recently other artefacts) for that past two years, and Spark on Kubernetes 
on-premise, I have come to the conclusion that Spark is not a beast that that 
one can fully commoditize it much like one can do with  Zookeeper, Kafka etc. 
There is always a struggle to make some niche areas of Spark like Spark 
Structured Streaming (SSS) work seamlessly and effortlessly on these commercial 
platforms with whatever as a Service.

Moreover, Spark (and I stand corrected) from the ground up has already a lot of 
resiliency and redundancy built in. It is truly an enterprise class product 
(requires enterprise class support) that will be difficult to commoditize with 
Kubernetes and expect the same performance. After all, Kubernetes is aimed at 
efficient resource sharing and potential cost saving for the mass market. In 
short I can see commercial enterprises will work on these platforms ,but may be 
the great talents on dev team should focus on stuff like the perceived 
limitation of SSS in dealing with chain of aggregation( if I am correct it is 
not yet supported on streaming datasets)

These are my opinions and they are not facts, just opinions so to speak :)

 [Image removed by sender.]   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.

On Fri, 18 Jun 2021 at 23:18, Holden Karau 
<hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>> wrote:
I think these approaches are good, but there are limitations (eg dynamic 
scaling) without us making changes inside of the Spark Kube scheduler.

Certainly whichever scheduler extensions we add support for we should 
collaborate with the people developing those extensions insofar as they are 
interested. My first place that I checked was #sig-scheduling which is fairly 
quite on the Kubernetes slack but if there are more places to look for folks 
interested in batch scheduling on Kubernetes we should definitely give it a 
shot :)

On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:
Hi,

Regarding your point and I quote

"..  I know that one of the Spark on Kube operators supports volcano/kube-batch 
so I was thinking that might be a place I would start exploring..."

There seems to be ongoing work on say Volcano as part of  Cloud Native 
Computing Foundation<https://cncf.io/> (CNCF). For example through 
https://github.com/volcano-sh/volcano

There may be value-add in collaborating with such groups through CNCF in order 
to have a collective approach to such work. There also seems to be some work on 
Integration of Spark with Volcano for Batch 
Scheduling.<https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/volcano-integration.md>

What is not very clear is the degree of progress of these projects. You may be 
kind enough to elaborate on KPI for each of these projects and where you think 
your contributions is going to be.

HTH,

Mich

 [Image removed by sender.]   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.

On Fri, 18 Jun 2021 at 00:44, Holden Karau 
<hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>> wrote:
Hi Folks,

I'm continuing my adventures to make Spark on containers party and I
was wondering if folks have experience with the different batch
scheduler options that they prefer? I was thinking so that we can
better support dynamic allocation it might make sense for us to
support using different schedulers and I wanted to see if there are
any that the community is more interested in?

I know that one of the Spark on Kube operators supports
volcano/kube-batch so I was thinking that might be a place I start
exploring but also want to be open to other schedulers that folks
might be interested in.

Cheers,

Holden :)

--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
---------------------------------------------------------------------
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>
--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
<https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Spark on Kubernetes scheduler variety

Reply via email to