+1 for SPIP According our production experience, the default scheduler isn't meeting prod requirements on K8S, and such effort of integrating with batch-native schedulers makes running Spark natively on K8S much easier for users.
Thanks, Bowen On Wed, Jan 5, 2022 at 11:52 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > +1 non-binding > > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Wed, 5 Jan 2022 at 19:16, Holden Karau <hol...@pigscanfly.ca> wrote: > >> Do we want to move the SPIP forward to a vote? It seems like we're mostly >> agreeing in principle? >> >> On Wed, Jan 5, 2022 at 11:12 AM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Hi Bo, >>> >>> Thanks for the info. Let me elaborate: >>> >>> In theory you can set the number of executors to multiple values of >>> Nodes. For example if you have a three node k8s cluster (in my case Google >>> GKE), you can set the number of executors to 6 and end up with six >>> executors queuing to start but ultimately you finish with two running >>> executors plus the driver in a 3 node cluster as shown below >>> >>> hduser@ctpvm: /home/hduser> k get pods -n spark >>> >>> NAME READY STATUS RESTARTS >>> AGE >>> >>> *randomdatabigquery-d42d067e2b91c88a-exec-1 1/1 Running 0 >>> 33s* >>> >>> *randomdatabigquery-d42d067e2b91c88a-exec-2 1/1 Running 0 >>> 33s* >>> >>> randomdatabigquery-d42d067e2b91c88a-exec-3 0/1 Pending 0 >>> 33s >>> >>> randomdatabigquery-d42d067e2b91c88a-exec-4 0/1 Pending 0 >>> 33s >>> >>> randomdatabigquery-d42d067e2b91c88a-exec-5 0/1 Pending 0 >>> 33s >>> >>> randomdatabigquery-d42d067e2b91c88a-exec-6 0/1 Pending 0 >>> 33s >>> >>> *sparkbq-0beda77e2b919e01-driver 1/1 Running 0 >>> 45s* >>> >>> hduser@ctpvm: /home/hduser> k get pods -n spark >>> >>> NAME READY STATUS RESTARTS >>> AGE >>> >>> randomdatabigquery-d42d067e2b91c88a-exec-1 1/1 Running 0 >>> 38s >>> >>> randomdatabigquery-d42d067e2b91c88a-exec-2 1/1 Running 0 >>> 38s >>> >>> sparkbq-0beda77e2b919e01-driver 1/1 Running 0 >>> 50s >>> >>> hduser@ctpvm: /home/hduser> k get pods -n spark >>> >>> *NAME READY STATUS >>> RESTARTS AGE* >>> >>> *randomdatabigquery-d42d067e2b91c88a-exec-1 1/1 Running 0 >>> 40s* >>> >>> *randomdatabigquery-d42d067e2b91c88a-exec-2 1/1 Running 0 >>> 40s* >>> >>> *sparkbq-0beda77e2b919e01-driver 1/1 Running 0 >>> 52s* >>> >>> So you end up with the three added executors dropping out. Hence the >>> conclusion seems to be you want to fit exactly one Spark executor pod >>> per Kubernetes node with the current model. >>> >>> HTH >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Wed, 5 Jan 2022 at 17:01, bo yang <bobyan...@gmail.com> wrote: >>> >>>> Hi Mich, >>>> >>>> Curious what do you mean “The constraint seems to be that you can fit one >>>> Spark executor pod per Kubernetes node and from my tests you don't seem to >>>> be able to allocate more than 50% of RAM on the node to the >>>> container", Would you help to explain a bit? Asking this because there >>>> could be multiple executor pods running on a single Kuberentes node. >>>> >>>> Thanks, >>>> Bo >>>> >>>> >>>> On Wed, Jan 5, 2022 at 1:13 AM Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> Thanks William for the info. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> The current model of Spark on k8s has certain drawbacks with pod based >>>>> scheduling as I tested it on Google Kubernetes Cluster (GKE). The >>>>> constraint seems to be that you can fit one Spark executor pod per >>>>> Kubernetes node and from my tests you don't seem to be able to allocate >>>>> more than 50% of RAM on the node to the container. >>>>> >>>>> >>>>> [image: gke_memoeyPlot.png] >>>>> >>>>> >>>>> Anymore results in the container never been created (stuck at pending) >>>>> >>>>> kubectl describe pod sparkbq-b506ac7dc521b667-driver -n spark >>>>> >>>>> Events: >>>>> >>>>> Type Reason Age From >>>>> Message >>>>> >>>>> ---- ------ ---- ---- >>>>> ------- >>>>> >>>>> Warning FailedScheduling 17m default-scheduler >>>>> 0/3 nodes are available: 3 Insufficient memory. >>>>> >>>>> Warning FailedScheduling 17m default-scheduler >>>>> 0/3 nodes are available: 3 Insufficient memory. >>>>> >>>>> Normal NotTriggerScaleUp 2m28s (x92 over 17m) cluster-autoscaler >>>>> pod didn't trigger scale-up: >>>>> >>>>> Obviously this is far from ideal and this model although works is not >>>>> efficient. >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> >>>>> Mich >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction >>>>> >>>>> of data or any other property which may arise from relying on this >>>>> email's technical content is explicitly disclaimed. >>>>> >>>>> The author will in no case be liable for any monetary damages arising >>>>> from such >>>>> >>>>> loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, 5 Jan 2022 at 03:55, William Wang <wang.platf...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Mich, >>>>>> >>>>>> Here are parts of performance indications in Volcano. >>>>>> 1. Scheduler throughput: 1.5k pod/s (default scheduler: 100 Pod/s) >>>>>> 2. Spark application performance improved 30%+ with minimal resource >>>>>> reservation feature in case of insufficient resource.(tested with TPC-DS) >>>>>> >>>>>> We are still working on more optimizations. Besides the performance, >>>>>> Volcano is continuously enhanced in below four directions to provide >>>>>> abilities that users care about. >>>>>> - Full lifecycle management for jobs >>>>>> - Scheduling policies for high-performance workloads(fair-share, >>>>>> topology, sla, reservation, preemption, backfill etc) >>>>>> - Support for heterogeneous hardware >>>>>> - Performance optimization for high-performance workloads >>>>>> >>>>>> Thanks >>>>>> LeiBo >>>>>> >>>>>> Mich Talebzadeh <mich.talebza...@gmail.com> 于2022年1月4日周二 18:12写道: >>>>>> >>>>> Interesting,thanks >>>>>>> >>>>>>> Do you have any indication of the ballpark figure (a rough >>>>>>> numerical estimate) of adding Volcano as an alternative scheduler >>>>>>> is going to improve Spark on k8s performance? >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> view my Linkedin profile >>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>> for any loss, damage or destruction >>>>>>> >>>>>>> of data or any other property which may arise from relying on this >>>>>>> email's technical content is explicitly disclaimed. >>>>>>> >>>>>>> The author will in no case be liable for any monetary damages >>>>>>> arising from such >>>>>>> >>>>>>> loss, damage or destruction. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, 4 Jan 2022 at 09:43, Yikun Jiang <yikunk...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, folks! Wishing you all the best in 2022. >>>>>>>> >>>>>>>> I'd like to share the current status on "Support Customized K8S >>>>>>>> Scheduler in Spark". >>>>>>>> >>>>>>>> >>>>>>>> https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg/edit#heading=h.1quyr1r2kr5n >>>>>>>> >>>>>>>> Framework/Common support >>>>>>>> >>>>>>>> - Volcano and Yunikorn team join the discussion and complete the >>>>>>>> initial doc on framework/common part. >>>>>>>> >>>>>>>> - SPARK-37145 <https://issues.apache.org/jira/browse/SPARK-37145> >>>>>>>> (under reviewing): We proposed to extend the customized scheduler by >>>>>>>> just >>>>>>>> using a custom feature step, it will meet the requirement of customized >>>>>>>> scheduler after it gets merged. After this, the user can enable >>>>>>>> featurestep >>>>>>>> and scheduler like: >>>>>>>> >>>>>>>> spark-submit \ >>>>>>>> >>>>>>>> --conf spark.kubernete.scheduler.name volcano \ >>>>>>>> >>>>>>>> --conf spark.kubernetes.driver.pod.featureSteps >>>>>>>> org.apache.spark.deploy.k8s.features.scheduler.VolcanoFeatureStep >>>>>>>> >>>>>>>> --conf spark.kubernete.job.queue xxx >>>>>>>> >>>>>>>> (such as above, the VolcanoFeatureStep will help to set the the >>>>>>>> spark scheduler queue according user specified conf) >>>>>>>> >>>>>>>> - SPARK-37331 <https://issues.apache.org/jira/browse/SPARK-37331>: >>>>>>>> Added the ability to create kubernetes resources before driver pod >>>>>>>> creation. >>>>>>>> >>>>>>>> - SPARK-36059 <https://issues.apache.org/jira/browse/SPARK-36059>: >>>>>>>> Add the ability to specify a scheduler in driver/executor >>>>>>>> >>>>>>>> After above all, the framework/common support would be ready for >>>>>>>> most of customized schedulers >>>>>>>> >>>>>>>> Volcano part: >>>>>>>> >>>>>>>> - SPARK-37258 <https://issues.apache.org/jira/browse/SPARK-37258>: >>>>>>>> Upgrade kubernetes-client to 5.11.1 to add volcano scheduler API >>>>>>>> support. >>>>>>>> >>>>>>>> - SPARK-36061 <https://issues.apache.org/jira/browse/SPARK-36061>: >>>>>>>> Add a VolcanoFeatureStep to help users to create a PodGroup with user >>>>>>>> specified minimum resources required, there is also a WIP commit >>>>>>>> to show the preview of this >>>>>>>> <https://github.com/Yikun/spark/pull/45/commits/81bf6f98edb5c00ebd0662dc172bc73f980b6a34> >>>>>>>> . >>>>>>>> >>>>>>>> Yunikorn part: >>>>>>>> >>>>>>>> - @WeiweiYang is completing the doc of the Yunikorn part and >>>>>>>> implementing the Yunikorn part. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Yikun >>>>>>>> >>>>>>>> >>>>>>>> Weiwei Yang <w...@apache.org> 于2021年12月2日周四 02:00写道: >>>>>>>> >>>>>>>>> Thank you Yikun for the info, and thanks for inviting me to a >>>>>>>>> meeting to discuss this. >>>>>>>>> I appreciate your effort to put these together, and I agree that >>>>>>>>> the purpose is to make Spark easy/flexible enough to support other K8s >>>>>>>>> schedulers (not just for Volcano). >>>>>>>>> As discussed, could you please help to abstract out the things in >>>>>>>>> common and allow Spark to plug different implementations? I'd be >>>>>>>>> happy to >>>>>>>>> work with you guys on this issue. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Nov 30, 2021 at 6:49 PM Yikun Jiang <yikunk...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> @Weiwei @Chenya >>>>>>>>>> >>>>>>>>>> > Thanks for bringing this up. This is quite interesting, we >>>>>>>>>> definitely should participate more in the discussions. >>>>>>>>>> >>>>>>>>>> Thanks for your reply and welcome to join the discussion, I think >>>>>>>>>> the input from Yunikorn is very critical. >>>>>>>>>> >>>>>>>>>> > The main thing here is, the Spark community should make Spark >>>>>>>>>> pluggable in order to support other schedulers, not just for >>>>>>>>>> Volcano. It >>>>>>>>>> looks like this proposal is pushing really hard for adopting >>>>>>>>>> PodGroup, >>>>>>>>>> which isn't part of K8s yet, that to me is problematic. >>>>>>>>>> >>>>>>>>>> Definitely yes, we are on the same page. >>>>>>>>>> >>>>>>>>>> I think we have the same goal: propose a general and reasonable >>>>>>>>>> mechanism to make spark on k8s with a custom scheduler more usable. >>>>>>>>>> >>>>>>>>>> But for the PodGroup, just allow me to do a brief introduction: >>>>>>>>>> - The PodGroup definition has been approved by Kubernetes >>>>>>>>>> officially in KEP-583. [1] >>>>>>>>>> - It can be regarded as a general concept/standard in Kubernetes >>>>>>>>>> rather than a specific concept in Volcano, there are also others to >>>>>>>>>> implement it, such as [2][3]. >>>>>>>>>> - Kubernetes recommends using CRD to do more extension to >>>>>>>>>> implement what they want. [4] >>>>>>>>>> - Volcano as extension provides an interface to maintain the life >>>>>>>>>> cycle PodGroup CRD and use volcano-scheduler to complete the >>>>>>>>>> scheduling. >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/583-coscheduling >>>>>>>>>> [2] >>>>>>>>>> https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/coscheduling#podgroup >>>>>>>>>> [3] https://github.com/kubernetes-sigs/kube-batch >>>>>>>>>> [4] >>>>>>>>>> https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/ >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Yikun >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Weiwei Yang <w...@apache.org> 于2021年12月1日周三 上午5:57写道: >>>>>>>>>> >>>>>>>>>>> Hi Chenya >>>>>>>>>>> >>>>>>>>>>> Thanks for bringing this up. This is quite interesting, we >>>>>>>>>>> definitely should participate more in the discussions. >>>>>>>>>>> The main thing here is, the Spark community should make Spark >>>>>>>>>>> pluggable in order to support other schedulers, not just for >>>>>>>>>>> Volcano. It >>>>>>>>>>> looks like this proposal is pushing really hard for adopting >>>>>>>>>>> PodGroup, >>>>>>>>>>> which isn't part of K8s yet, that to me is problematic. >>>>>>>>>>> >>>>>>>>>>> On Tue, Nov 30, 2021 at 9:21 AM Prasad Paravatha < >>>>>>>>>>> prasad.parava...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> This is a great feature/idea. >>>>>>>>>>>> I'd love to get involved in some form (testing and/or >>>>>>>>>>>> documentation). This could be my 1st contribution to Spark! >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Nov 30, 2021 at 10:46 PM John Zhuge <jzh...@apache.org> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> +1 Kudos to Yikun and the community for starting the >>>>>>>>>>>>> discussion! >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Nov 30, 2021 at 8:47 AM Chenya Zhang < >>>>>>>>>>>>> chenyazhangche...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks folks for bringing up the topic of natively >>>>>>>>>>>>>> integrating Volcano and other alternative schedulers into Spark! >>>>>>>>>>>>>> >>>>>>>>>>>>>> +Weiwei, Wilfred, Chaoran. We would love to contribute to the >>>>>>>>>>>>>> discussion as well. >>>>>>>>>>>>>> >>>>>>>>>>>>>> From our side, we have been using and improving on one >>>>>>>>>>>>>> alternative resource scheduler, Apache YuniKorn ( >>>>>>>>>>>>>> https://yunikorn.apache.org/), for Spark on Kubernetes in >>>>>>>>>>>>>> production at Apple with solid results in the past year. It is >>>>>>>>>>>>>> capable of >>>>>>>>>>>>>> supporting Gang scheduling (similar to PodGroups), multi-tenant >>>>>>>>>>>>>> resource >>>>>>>>>>>>>> queues (similar to YARN), FIFO, and other handy features like >>>>>>>>>>>>>> bin packing >>>>>>>>>>>>>> to enable efficient autoscaling, etc. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Natively integrating with Spark would provide more >>>>>>>>>>>>>> flexibility for users and reduce the extra cost and potential >>>>>>>>>>>>>> inconsistency >>>>>>>>>>>>>> of maintaining different layers of resource strategies. One >>>>>>>>>>>>>> interesting >>>>>>>>>>>>>> topic we hope to discuss more about is dynamic allocation, which >>>>>>>>>>>>>> would >>>>>>>>>>>>>> benefit from native coordination between Spark and resource >>>>>>>>>>>>>> schedulers in >>>>>>>>>>>>>> K8s & cloud environment for an optimal resource efficiency. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Nov 30, 2021 at 8:10 AM Holden Karau < >>>>>>>>>>>>>> hol...@pigscanfly.ca> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks for putting this together, I’m really excited for us >>>>>>>>>>>>>>> to add better batch scheduling integrations. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Nov 30, 2021 at 12:46 AM Yikun Jiang < >>>>>>>>>>>>>>> yikunk...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hey everyone, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'd like to start a discussion on "Support >>>>>>>>>>>>>>>> Volcano/Alternative Schedulers Proposal". >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This SPIP is proposed to make spark k8s schedulers provide >>>>>>>>>>>>>>>> more YARN like features (such as queues and minimum resources >>>>>>>>>>>>>>>> before >>>>>>>>>>>>>>>> scheduling jobs) that many folks want on Kubernetes. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The goal of this SPIP is to improve current spark k8s >>>>>>>>>>>>>>>> scheduler implementations, add the ability of batch scheduling >>>>>>>>>>>>>>>> and support >>>>>>>>>>>>>>>> volcano as one of implementations. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Design doc: >>>>>>>>>>>>>>>> https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg >>>>>>>>>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-36057 >>>>>>>>>>>>>>>> Part of PRs: >>>>>>>>>>>>>>>> Ability to create resources >>>>>>>>>>>>>>>> https://github.com/apache/spark/pull/34599 >>>>>>>>>>>>>>>> Add PodGroupFeatureStep: >>>>>>>>>>>>>>>> https://github.com/apache/spark/pull/34456 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>> Yikun >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>>>>>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>>>>>>>>>> YouTube Live Streams: >>>>>>>>>>>>>>> https://www.youtube.com/user/holdenkarau >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> John Zhuge >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Regards, >>>>>>>>>>>> Prasad Paravatha >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> >