Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Qiegang Long Thu, 11 Dec 2025 10:43:24 -0800

Thanks for the reply.

Have you tested in environments where O is bigger than H? Wondering if the
proposed algorithm would help more in those environments (eg. with
native accelerators)?




On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <[email protected]> wrote:

> Hi, Qiegang, thanks for the good questions as well
>
> please check the following answer
>
> > My initial understanding is that Kubernetes will use the Executor
> Memory Request (H + G) for scheduling decisions, which allows for better
> resource packing.
>
> yes, your understanding is correct
>
> > How is the risk of host-level OOM mitigated when the total potential
> usage  sum of H+G+S across all pods on a node exceeds its allocatable
> capacity? Does the proposal implicitly rely on the cluster operator to
> manually ensure an unrequested memory buffer exists on the node to serve as
> the shared pool?
>
> in PINS, we basically apply a set of strategies, setting conservative
> bursty factor, progressive rollout, monitor the cluster metrics like Linux
> Kernel OOMKiller occurrence to guide us to the optimal setup of bursty
> factor... in usual, K8S operators will set a reserved space for daemon
> processes on each host, we found it is sufficient to in our case and our
> major tuning focuses on bursty factor value
>
>
> > Have you considered scheduling optimizations to ensure a strategic mix
> of executors with large S and small S values on a single node?  I am
> wondering if this would reduce the probability of concurrent bursting and
> host-level OOM.
>
> Yes, when we work on this project, we put some attention on the cluster
> scheduling policy/behavior... two things we mostly care about
>
> 1. as stated in the SPIP doc, the cluster should have certain level of
> diversity of workloads so that we have enough candidates to form a mixed
> set of executors with large S and small S values
>
> 2. we avoid using binpack scheduling algorithm which tends to pack more
> pods from the same job to the same host, which can create troubles as they
> are more likely to ask for max memory at the same time
>
>
>
> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <[email protected]> wrote:
>
>> Thanks for sharing this interesting proposal.
>>
>> My initial understanding is that Kubernetes will use the Executor Memory
>> Request (H + G) for scheduling decisions, which allows for better
>> resource packing.  I have a few questions regarding the shared portion S:
>>
>>    1. How is the risk of host-level OOM mitigated when the total
>>    potential usage  sum of H+G+S across all pods on a node exceeds its
>>    allocatable capacity? Does the proposal implicitly rely on the cluster
>>    operator to manually ensure an unrequested memory buffer exists on the 
>> node
>>    to serve as the shared pool?
>>    2. Have you considered scheduling optimizations to ensure a strategic
>>    mix of executors with large S and small S values on a single node?  I
>>    am wondering if this would reduce the probability of concurrent bursting
>>    and host-level OOM.
>>
>>
>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <[email protected]> wrote:
>>
>>> I think I'm still missing something in the big picture:
>>>
>>>    - Is the memory overhead off-heap? The formular indicates a fixed
>>>    heap size, and memory overhead can't be dynamic if it's on-heap.
>>>    - Do Spark applications have static profiles? When we submit stages,
>>>    the cluster is already allocated, how can we change anything?
>>>    - How do we assign the shared memory overhead? Fairly among all
>>>    applications on the same physical node?
>>>
>>>
>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <[email protected]> wrote:
>>>
>>>> we didn't separate the design into another doc since the main idea is
>>>> relatively simple...
>>>>
>>>> for request/limit calculation, I described it in Q4 of the SPIP doc
>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0
>>>>
>>>> it is calculated based on per profile (you can say it is based on per
>>>> stage), when the cluster manager compose the pod spec, it calculates the
>>>> new memory overhead based on what user asks for in that resource profile
>>>>
>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <[email protected]> wrote:
>>>>
>>>>> Do we have a design sketch? How to determine the memory request and
>>>>> limit? Is it per stage or per executor?
>>>>>
>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <[email protected]> wrote:
>>>>>
>>>>>> yeah, the implementation is basically relying on the request/limit
>>>>>> concept in K8S, ...
>>>>>>
>>>>>> but if there is any other cluster manager coming in future,  as long
>>>>>> as it has a similar concept , it can leverage this easily as the main 
>>>>>> logic
>>>>>> is implemented in ResourceProfile
>>>>>>
>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> This feature is only available on k8s because it allows containers
>>>>>>> to have dynamic resources?
>>>>>>>
>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <[email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Folks,
>>>>>>>>
>>>>>>>> We are proposing a burst-aware memoryOverhead allocation algorithm
>>>>>>>> for Spark@K8S to improve memory utilization of spark clusters.
>>>>>>>> Please see more details in SPIP doc
>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>.
>>>>>>>> Feedbacks and discussions are welcomed.
>>>>>>>>
>>>>>>>> Thanks Chao for being shepard of this feature.
>>>>>>>> Also want to thank the authors of the original paper
>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from ByteDance,
>>>>>>>> specifically Rui([email protected]) and Yixin(
>>>>>>>> [email protected]).
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>> Yao Wang
>>>>>>>>
>>>>>>>

Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Reply via email to