Re: [DISCUSSION] SPIP: Support Volcano/Alternative Schedulers Proposal

Holden Karau Wed, 05 Jan 2022 11:16:46 -0800

Do we want to move the SPIP forward to a vote? It seems like we're mostly
agreeing in principle?


On Wed, Jan 5, 2022 at 11:12 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Bo,
>
> Thanks for the info. Let me elaborate:
>
> In theory you can set the number of executors to multiple values of Nodes.
> For example if you have a three node k8s cluster (in my case Google GKE),
> you can set the number of executors to 6 and end up with six executors
> queuing to start but ultimately you finish with two running executors plus
> the driver in a 3 node cluster as shown below
>
> hduser@ctpvm: /home/hduser> k get pods -n spark
>
> NAME                                         READY   STATUS    RESTARTS
>  AGE
>
> *randomdatabigquery-d42d067e2b91c88a-exec-1   1/1     Running   0
> 33s*
>
> *randomdatabigquery-d42d067e2b91c88a-exec-2   1/1     Running   0
> 33s*
>
> randomdatabigquery-d42d067e2b91c88a-exec-3   0/1     Pending   0
> 33s
>
> randomdatabigquery-d42d067e2b91c88a-exec-4   0/1     Pending   0
> 33s
>
> randomdatabigquery-d42d067e2b91c88a-exec-5   0/1     Pending   0
> 33s
>
> randomdatabigquery-d42d067e2b91c88a-exec-6   0/1     Pending   0
> 33s
>
> *sparkbq-0beda77e2b919e01-driver              1/1     Running   0
> 45s*
>
> hduser@ctpvm: /home/hduser> k get pods -n spark
>
> NAME                                         READY   STATUS    RESTARTS
>  AGE
>
> randomdatabigquery-d42d067e2b91c88a-exec-1   1/1     Running   0
> 38s
>
> randomdatabigquery-d42d067e2b91c88a-exec-2   1/1     Running   0
> 38s
>
> sparkbq-0beda77e2b919e01-driver              1/1     Running   0
> 50s
>
> hduser@ctpvm: /home/hduser> k get pods -n spark
>
> *NAME                                         READY   STATUS    RESTARTS
>  AGE*
>
> *randomdatabigquery-d42d067e2b91c88a-exec-1   1/1     Running   0
> 40s*
>
> *randomdatabigquery-d42d067e2b91c88a-exec-2   1/1     Running   0
> 40s*
>
> *sparkbq-0beda77e2b919e01-driver              1/1     Running   0
> 52s*
>
> So you end up with the three added executors dropping out. Hence the
> conclusion seems to be you want to fit exactly one Spark executor pod per
> Kubernetes node with the current model.
>
> HTH
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 5 Jan 2022 at 17:01, bo yang <bobyan...@gmail.com> wrote:
>
>> Hi Mich,
>>
>> Curious what do you mean “The constraint seems to be that you can fit one
>> Spark executor pod per Kubernetes node and from my tests you don't seem to
>> be able to allocate more than 50% of RAM on the node to the container",
>> Would you help to explain a bit? Asking this because there could be
>> multiple executor pods running on a single Kuberentes node.
>>
>> Thanks,
>> Bo
>>
>>
>> On Wed, Jan 5, 2022 at 1:13 AM Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> Thanks William for the info.
>>>
>>>
>>>
>>>
>>>
>>> The current model of Spark on k8s has certain drawbacks with pod based
>>> scheduling as I  tested it on Google Kubernetes Cluster (GKE). The
>>> constraint seems to be that you can fit one Spark executor pod per
>>> Kubernetes node and from my tests you don't seem to be able to allocate
>>> more than 50% of RAM on the node to the container.
>>>
>>>
>>> [image: gke_memoeyPlot.png]
>>>
>>>
>>> Anymore results in the container never been created (stuck at pending)
>>>
>>> kubectl describe pod sparkbq-b506ac7dc521b667-driver -n spark
>>>
>>>  Events:
>>>
>>>   Type     Reason             Age                   From                
>>> Message
>>>
>>>   ----     ------             ----                  ----                
>>> -------
>>>
>>>   Warning  FailedScheduling   17m                   default-scheduler   0/3 
>>> nodes are available: 3 Insufficient memory.
>>>
>>>   Warning  FailedScheduling   17m                   default-scheduler   0/3 
>>> nodes are available: 3 Insufficient memory.
>>>
>>>   Normal   NotTriggerScaleUp  2m28s (x92 over 17m)  cluster-autoscaler  pod 
>>> didn't trigger scale-up:
>>>
>>> Obviously this is far from ideal and this model although works is not
>>> efficient.
>>>
>>>
>>> Cheers,
>>>
>>>
>>> Mich
>>>
>>>
>>>
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction
>>>
>>> of data or any other property which may arise from relying on this
>>> email's technical content is explicitly disclaimed.
>>>
>>> The author will in no case be liable for any monetary damages arising
>>> from such
>>>
>>> loss, damage or destruction.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, 5 Jan 2022 at 03:55, William Wang <wang.platf...@gmail.com>
>>> wrote:
>>>
>>>> Hi Mich,
>>>>
>>>> Here are parts of performance indications in Volcano.
>>>> 1. Scheduler throughput: 1.5k pod/s (default scheduler: 100 Pod/s)
>>>> 2. Spark application performance improved 30%+ with minimal resource
>>>> reservation feature in case of insufficient resource.(tested with TPC-DS)
>>>>
>>>> We are still working on more optimizations. Besides the performance,
>>>> Volcano is continuously enhanced in below four directions to provide
>>>> abilities that users care about.
>>>> - Full lifecycle management for jobs
>>>> - Scheduling policies for high-performance workloads(fair-share,
>>>> topology, sla, reservation, preemption, backfill etc)
>>>> - Support for heterogeneous hardware
>>>> - Performance optimization for high-performance workloads
>>>>
>>>> Thanks
>>>> LeiBo
>>>>
>>>> Mich Talebzadeh <mich.talebza...@gmail.com> 于2022年1月4日周二 18:12写道：
>>>>
>>> Interesting,thanks
>>>>>
>>>>> Do you have any indication of the ballpark figure (a rough numerical
>>>>> estimate) of adding Volcano as an alternative scheduler is going to
>>>>> improve Spark on k8s performance?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction
>>>>>
>>>>> of data or any other property which may arise from relying on this
>>>>> email's technical content is explicitly disclaimed.
>>>>>
>>>>> The author will in no case be liable for any monetary damages arising
>>>>> from such
>>>>>
>>>>> loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, 4 Jan 2022 at 09:43, Yikun Jiang <yikunk...@gmail.com> wrote:
>>>>>
>>>>>> Hi, folks! Wishing you all the best in 2022.
>>>>>>
>>>>>> I'd like to share the current status on "Support Customized K8S
>>>>>> Scheduler in Spark".
>>>>>>
>>>>>>
>>>>>> https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg/edit#heading=h.1quyr1r2kr5n
>>>>>>
>>>>>> Framework/Common support
>>>>>>
>>>>>> - Volcano and Yunikorn team join the discussion and complete the
>>>>>> initial doc on framework/common part.
>>>>>>
>>>>>> - SPARK-37145 <https://issues.apache.org/jira/browse/SPARK-37145>
>>>>>> (under reviewing): We proposed to extend the customized scheduler by just
>>>>>> using a custom feature step, it will meet the requirement of customized
>>>>>> scheduler after it gets merged. After this, the user can enable 
>>>>>> featurestep
>>>>>> and scheduler like:
>>>>>>
>>>>>> spark-submit \
>>>>>>
>>>>>>     --conf spark.kubernete.scheduler.name volcano \
>>>>>>
>>>>>>     --conf spark.kubernetes.driver.pod.featureSteps
>>>>>> org.apache.spark.deploy.k8s.features.scheduler.VolcanoFeatureStep
>>>>>>
>>>>>> --conf spark.kubernete.job.queue xxx
>>>>>>
>>>>>> (such as above, the VolcanoFeatureStep will help to set the the spark
>>>>>> scheduler queue according user specified conf)
>>>>>>
>>>>>> - SPARK-37331 <https://issues.apache.org/jira/browse/SPARK-37331>:
>>>>>> Added the ability to create kubernetes resources before driver pod 
>>>>>> creation.
>>>>>>
>>>>>> - SPARK-36059 <https://issues.apache.org/jira/browse/SPARK-36059>:
>>>>>> Add the ability to specify a scheduler in driver/executor
>>>>>>
>>>>>> After above all, the framework/common support would be ready for most
>>>>>> of customized schedulers
>>>>>>
>>>>>> Volcano part:
>>>>>>
>>>>>> - SPARK-37258 <https://issues.apache.org/jira/browse/SPARK-37258>:
>>>>>> Upgrade kubernetes-client to 5.11.1 to add volcano scheduler API support.
>>>>>>
>>>>>> - SPARK-36061 <https://issues.apache.org/jira/browse/SPARK-36061>:
>>>>>> Add a VolcanoFeatureStep to help users to create a PodGroup with user
>>>>>> specified minimum resources required, there is also a WIP commit to
>>>>>> show the preview of this
>>>>>> <https://github.com/Yikun/spark/pull/45/commits/81bf6f98edb5c00ebd0662dc172bc73f980b6a34>
>>>>>> .
>>>>>>
>>>>>> Yunikorn part:
>>>>>>
>>>>>> - @WeiweiYang is completing the doc of the Yunikorn part and
>>>>>> implementing the Yunikorn part.
>>>>>>
>>>>>> Regards,
>>>>>> Yikun
>>>>>>
>>>>>>
>>>>>> Weiwei Yang <w...@apache.org> 于2021年12月2日周四 02:00写道：
>>>>>>
>>>>>>> Thank you Yikun for the info, and thanks for inviting me to a
>>>>>>> meeting to discuss this.
>>>>>>> I appreciate your effort to put these together, and I agree that the
>>>>>>> purpose is to make Spark easy/flexible enough to support other K8s
>>>>>>> schedulers (not just for Volcano).
>>>>>>> As discussed, could you please help to abstract out the things in
>>>>>>> common and allow Spark to plug different implementations? I'd be happy 
>>>>>>> to
>>>>>>> work with you guys on this issue.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Nov 30, 2021 at 6:49 PM Yikun Jiang <yikunk...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> @Weiwei @Chenya
>>>>>>>>
>>>>>>>> > Thanks for bringing this up. This is quite interesting, we
>>>>>>>> definitely should participate more in the discussions.
>>>>>>>>
>>>>>>>> Thanks for your reply and welcome to join the discussion, I think
>>>>>>>> the input from Yunikorn is very critical.
>>>>>>>>
>>>>>>>> > The main thing here is, the Spark community should make Spark
>>>>>>>> pluggable in order to support other schedulers, not just for Volcano. 
>>>>>>>> It
>>>>>>>> looks like this proposal is pushing really hard for adopting PodGroup,
>>>>>>>> which isn't part of K8s yet, that to me is problematic.
>>>>>>>>
>>>>>>>> Definitely yes, we are on the same page.
>>>>>>>>
>>>>>>>> I think we have the same goal: propose a general and reasonable
>>>>>>>> mechanism to make spark on k8s with a custom scheduler more usable.
>>>>>>>>
>>>>>>>> But for the PodGroup, just allow me to do a brief introduction:
>>>>>>>> - The PodGroup definition has been approved by Kubernetes
>>>>>>>> officially in KEP-583. [1]
>>>>>>>> - It can be regarded as a general concept/standard in Kubernetes
>>>>>>>> rather than a specific concept in Volcano, there are also others to
>>>>>>>> implement it, such as [2][3].
>>>>>>>> - Kubernetes recommends using CRD to do more extension to implement
>>>>>>>> what they want. [4]
>>>>>>>> - Volcano as extension provides an interface to maintain the life
>>>>>>>> cycle PodGroup CRD and use volcano-scheduler to complete the 
>>>>>>>> scheduling.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/583-coscheduling
>>>>>>>> [2]
>>>>>>>> https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/coscheduling#podgroup
>>>>>>>> [3] https://github.com/kubernetes-sigs/kube-batch
>>>>>>>> [4]
>>>>>>>> https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Yikun
>>>>>>>>
>>>>>>>>
>>>>>>>> Weiwei Yang <w...@apache.org> 于2021年12月1日周三 上午5:57写道：
>>>>>>>>
>>>>>>>>> Hi Chenya
>>>>>>>>>
>>>>>>>>> Thanks for bringing this up. This is quite interesting, we
>>>>>>>>> definitely should participate more in the discussions.
>>>>>>>>> The main thing here is, the Spark community should make Spark
>>>>>>>>> pluggable in order to support other schedulers, not just for Volcano. 
>>>>>>>>> It
>>>>>>>>> looks like this proposal is pushing really hard for adopting PodGroup,
>>>>>>>>> which isn't part of K8s yet, that to me is problematic.
>>>>>>>>>
>>>>>>>>> On Tue, Nov 30, 2021 at 9:21 AM Prasad Paravatha <
>>>>>>>>> prasad.parava...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> This is a great feature/idea.
>>>>>>>>>> I'd love to get involved in some form (testing and/or
>>>>>>>>>> documentation). This could be my 1st contribution to Spark!
>>>>>>>>>>
>>>>>>>>>> On Tue, Nov 30, 2021 at 10:46 PM John Zhuge <jzh...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1 Kudos to Yikun and the community for starting the discussion!
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Nov 30, 2021 at 8:47 AM Chenya Zhang <
>>>>>>>>>>> chenyazhangche...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks folks for bringing up the topic of natively integrating
>>>>>>>>>>>> Volcano and other alternative schedulers into Spark!
>>>>>>>>>>>>
>>>>>>>>>>>> +Weiwei, Wilfred, Chaoran. We would love to contribute to the
>>>>>>>>>>>> discussion as well.
>>>>>>>>>>>>
>>>>>>>>>>>> From our side, we have been using and improving on one
>>>>>>>>>>>> alternative resource scheduler, Apache YuniKorn (
>>>>>>>>>>>> https://yunikorn.apache.org/), for Spark on Kubernetes in
>>>>>>>>>>>> production at Apple with solid results in the past year. It is 
>>>>>>>>>>>> capable of
>>>>>>>>>>>> supporting Gang scheduling (similar to PodGroups), multi-tenant 
>>>>>>>>>>>> resource
>>>>>>>>>>>> queues (similar to YARN), FIFO, and other handy features like bin 
>>>>>>>>>>>> packing
>>>>>>>>>>>> to enable efficient autoscaling, etc.
>>>>>>>>>>>>
>>>>>>>>>>>> Natively integrating with Spark would provide more flexibility
>>>>>>>>>>>> for users and reduce the extra cost and potential inconsistency of
>>>>>>>>>>>> maintaining different layers of resource strategies. One 
>>>>>>>>>>>> interesting topic
>>>>>>>>>>>> we hope to discuss more about is dynamic allocation, which would 
>>>>>>>>>>>> benefit
>>>>>>>>>>>> from native coordination between Spark and resource schedulers in 
>>>>>>>>>>>> K8s &
>>>>>>>>>>>> cloud environment for an optimal resource efficiency.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Nov 30, 2021 at 8:10 AM Holden Karau <
>>>>>>>>>>>> hol...@pigscanfly.ca> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for putting this together, I’m really excited for us to
>>>>>>>>>>>>> add better batch scheduling integrations.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Nov 30, 2021 at 12:46 AM Yikun Jiang <
>>>>>>>>>>>>> yikunk...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'd like to start a discussion on "Support
>>>>>>>>>>>>>> Volcano/Alternative Schedulers Proposal".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This SPIP is proposed to make spark k8s schedulers provide
>>>>>>>>>>>>>> more YARN like features (such as queues and minimum resources 
>>>>>>>>>>>>>> before
>>>>>>>>>>>>>> scheduling jobs) that many folks want on Kubernetes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The goal of this SPIP is to improve current spark k8s
>>>>>>>>>>>>>> scheduler implementations, add the ability of batch scheduling 
>>>>>>>>>>>>>> and support
>>>>>>>>>>>>>> volcano as one of implementations.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Design doc:
>>>>>>>>>>>>>> https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg
>>>>>>>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-36057
>>>>>>>>>>>>>> Part of PRs:
>>>>>>>>>>>>>> Ability to create resources
>>>>>>>>>>>>>> https://github.com/apache/spark/pull/34599
>>>>>>>>>>>>>> Add PodGroupFeatureStep:
>>>>>>>>>>>>>> https://github.com/apache/spark/pull/34456
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Yikun
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> John Zhuge
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Regards,
>>>>>>>>>> Prasad Paravatha
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [DISCUSSION] SPIP: Support Volcano/Alternative Schedulers Proposal

Reply via email to