Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Nan Zhu Tue, 09 Dec 2025 09:47:48 -0800

Hi, Qiegang, thanks for the good questions as well

please check the following answer


> My initial understanding is that Kubernetes will use the Executor Memory
Request (H + G) for scheduling decisions, which allows for better resource
packing.

yes, your understanding is correct

> How is the risk of host-level OOM mitigated when the total potential
usage  sum of H+G+S across all pods on a node exceeds its allocatable
capacity? Does the proposal implicitly rely on the cluster operator to
manually ensure an unrequested memory buffer exists on the node to serve as
the shared pool?

in PINS, we basically apply a set of strategies, setting conservative
bursty factor, progressive rollout, monitor the cluster metrics like Linux
Kernel OOMKiller occurrence to guide us to the optimal setup of bursty
factor... in usual, K8S operators will set a reserved space for daemon
processes on each host, we found it is sufficient to in our case and our
major tuning focuses on bursty factor value


> Have you considered scheduling optimizations to ensure a strategic mix of
executors with large S and small S values on a single node?  I am wondering
if this would reduce the probability of concurrent bursting and host-level
OOM.

Yes, when we work on this project, we put some attention on the cluster
scheduling policy/behavior... two things we mostly care about

1. as stated in the SPIP doc, the cluster should have certain level of
diversity of workloads so that we have enough candidates to form a mixed
set of executors with large S and small S values

2. we avoid using binpack scheduling algorithm which tends to pack more
pods from the same job to the same host, which can create troubles as they
are more likely to ask for max memory at the same time



On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <[email protected]> wrote:

> Thanks for sharing this interesting proposal.
>
> My initial understanding is that Kubernetes will use the Executor Memory
> Request (H + G) for scheduling decisions, which allows for better
> resource packing.  I have a few questions regarding the shared portion S:
>
>    1. How is the risk of host-level OOM mitigated when the total
>    potential usage  sum of H+G+S across all pods on a node exceeds its
>    allocatable capacity? Does the proposal implicitly rely on the cluster
>    operator to manually ensure an unrequested memory buffer exists on the node
>    to serve as the shared pool?
>    2. Have you considered scheduling optimizations to ensure a strategic
>    mix of executors with large S and small S values on a single node?  I
>    am wondering if this would reduce the probability of concurrent bursting
>    and host-level OOM.
>
>
> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <[email protected]> wrote:
>
>> I think I'm still missing something in the big picture:
>>
>>    - Is the memory overhead off-heap? The formular indicates a fixed
>>    heap size, and memory overhead can't be dynamic if it's on-heap.
>>    - Do Spark applications have static profiles? When we submit stages,
>>    the cluster is already allocated, how can we change anything?
>>    - How do we assign the shared memory overhead? Fairly among all
>>    applications on the same physical node?
>>
>>
>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <[email protected]> wrote:
>>
>>> we didn't separate the design into another doc since the main idea is
>>> relatively simple...
>>>
>>> for request/limit calculation, I described it in Q4 of the SPIP doc
>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0
>>>
>>> it is calculated based on per profile (you can say it is based on per
>>> stage), when the cluster manager compose the pod spec, it calculates the
>>> new memory overhead based on what user asks for in that resource profile
>>>
>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <[email protected]> wrote:
>>>
>>>> Do we have a design sketch? How to determine the memory request and
>>>> limit? Is it per stage or per executor?
>>>>
>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <[email protected]> wrote:
>>>>
>>>>> yeah, the implementation is basically relying on the request/limit
>>>>> concept in K8S, ...
>>>>>
>>>>> but if there is any other cluster manager coming in future,  as long
>>>>> as it has a similar concept , it can leverage this easily as the main 
>>>>> logic
>>>>> is implemented in ResourceProfile
>>>>>
>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> This feature is only available on k8s because it allows containers to
>>>>>> have dynamic resources?
>>>>>>
>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <[email protected]> wrote:
>>>>>>
>>>>>>> Hi Folks,
>>>>>>>
>>>>>>> We are proposing a burst-aware memoryOverhead allocation algorithm
>>>>>>> for Spark@K8S to improve memory utilization of spark clusters.
>>>>>>> Please see more details in SPIP doc
>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>.
>>>>>>> Feedbacks and discussions are welcomed.
>>>>>>>
>>>>>>> Thanks Chao for being shepard of this feature.
>>>>>>> Also want to thank the authors of the original paper
>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from ByteDance,
>>>>>>> specifically Rui([email protected]) and Yixin(
>>>>>>> [email protected]).
>>>>>>>
>>>>>>> Thank you.
>>>>>>> Yao Wang
>>>>>>>
>>>>>>

Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Reply via email to