Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Nan Zhu Thu, 11 Dec 2025 23:32:31 -0800

this is a good question

> a stage is bursty and consumes the shared portion and fails to release it
for subsequent stages


in the scenario you described, since the memory-leaking stage and the
subsequence ones are from the same job , the pod will likely be killed by
cgroup oomkiller

taking the following as the example

the usage pattern is  G = 5GB S = 2GB, it uses G + S at max and in theory,
it should release all 7G and then claim 7G again in some later stages,
however, due to the memory peak, it holds 2G forever and ask for another
7G, as a result,  it hits the pod memory limit  and cgroup oomkiller will
take action to terminate the pod

so this should be safe to the system



however, we should be careful about the memory peak for sure, because it
essentially breaks the assumption that the usage of memoryOverhead is
bursty (memory peak ~= use memory forever)... unfortunately,
shared/guaranteed memory is managed by user applications instead of on
cluster level , they, especially S, are just logical concepts  instead of a
physical memory pool which pods can explicitly claim memory from...


On Thu, Dec 11, 2025 at 10:17 PM karuppayya <[email protected]>
wrote:

> Thanks for the interesting proposal.
> The design seems to rely on memoryOverhead being transient.
> What happens when a stage is bursty and consumes the shared portion and
> fails to release it for subsequent stages (e.g.,  off-heap buffers and its
> not garbage collected since its off-heap)? Would this trigger the
> host-level OOM like described in Q6? or are there strategies to release the
> shared portion?
>
>
> On Thu, Dec 11, 2025 at 6:24 PM Nan Zhu <[email protected]> wrote:
>
>> yes, that's the worst case in the scenario, please check my earlier
>> response to Qiegang's question, we have a set of strategies adopted in prod
>> to mitigate the issue
>>
>> On Thu, Dec 11, 2025 at 6:21 PM Wenchen Fan <[email protected]> wrote:
>>
>>> Thanks for the explanation! So the executor is not guaranteed to get 50
>>> GB physical memory, right? All pods on the same host may reach peak memory
>>> usage at the same time and cause paging/swapping which hurts performance?
>>>
>>> On Fri, Dec 12, 2025 at 10:12 AM Nan Zhu <[email protected]> wrote:
>>>
>>>> np, let me try to explain
>>>>
>>>> 1. Each executor container will be run in a pod together with some
>>>> other sidecar containers taking care of tasks like authentication, etc. ,
>>>> for simplicity, we assume each pod has only one container which is the
>>>> executor container
>>>>
>>>> 2. Each container is assigned with two values, r*equest&limit** (limit
>>>> >= request),* for both of CPU/memory resources (we only discuss memory
>>>> here). Each pod will have request/limit values as the sum of all containers
>>>> belonging to this pod
>>>>
>>>> 3. K8S Scheduler chooses a machine to host a pod based on *request*
>>>> value, and cap the resource usage of each container based on their
>>>> *limit* value, e.g. if I have a pod with a single container in it ,
>>>> and it has 1G/2G as request and limit value respectively, any machine with
>>>> 1G free RAM space will be a candidate to host this pod, and when the
>>>> container use more than 2G memory, it will be killed by cgroup
>>>> oomkiller. Once a pod is scheduled to a host, the memory space sized at
>>>> "sum of all its containers' request values" will be booked exclusively for
>>>> this pod.
>>>>
>>>> 4. By default, Spark *sets request/limit as the same value for
>>>> executors in k8s*, and this value is basically spark.executor.memory +
>>>> spark.executor.memoryOverhead in most cases . However,
>>>> spark.executor.memoryOverhead usage is very bursty, the user setting
>>>> spark.executor.memoryOverhead as 10G usually means each executor only needs
>>>> 10G in a very small portion of the executor's whole lifecycle
>>>>
>>>> 5. The proposed SPIP is essentially to decouple request/limit value in
>>>> spark@k8s for executors in a safe way (this idea is from the bytedance
>>>> paper we refer to in SPIP paper).
>>>>
>>>> Using the aforementioned example ,
>>>>
>>>> if we have a single node cluster with 100G RAM space, we have two pods
>>>> requesting 40G + 10G (on-heap + memoryOverhead) and we set bursty factor to
>>>> 1.2, without the mechanism proposed in this SPIP, we can at most host 2
>>>> pods with this machine, and because of the bursty usage of that 10G space,
>>>> the memory utilization would be compromised.
>>>>
>>>> When applying the burst-aware memory allocation, we only need 40 + 10 -
>>>> min((40 + 10) * 0.2, 10) = 40G to host each pod, i.e. we have 20G free
>>>> memory space left in the machine which can be used to host some smaller
>>>> pods. At the same time, as we didn't change the limit value of the executor
>>>> pods, these executors can still use 50G at max.
>>>>
>>>>
>>>> On Thu, Dec 11, 2025 at 5:42 PM Wenchen Fan <[email protected]>
>>>> wrote:
>>>>
>>>>> Sorry I'm not very familiar with the k8s infra, how does it work under
>>>>> the hood? The container will adjust its system memory size depending on 
>>>>> the
>>>>> actual memory usage of the processes in this container?
>>>>>
>>>>> On Fri, Dec 12, 2025 at 2:49 AM Nan Zhu <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> yeah, we have a few cases that we have significantly larger O than
>>>>>> H, the proposed algorithm is actually a great fit
>>>>>>
>>>>>> as I explained in SPIP doc Appendix C, the proposed algorithm will
>>>>>> allocate a non-trivial G to ensure the safety of running but still cut a
>>>>>> big chunk of memory (10s of GBs) and treat them as S , saving tons of 
>>>>>> money
>>>>>> burnt by them
>>>>>>
>>>>>> but regarding native accelerators, some native acceleration engines
>>>>>> do not use memoryOverhead but use off-heap (spark.memory.offHeap.size)
>>>>>> explicitly (e.g. Gluten). The current implementation does not cover this
>>>>>> part , while that will be an easy extension
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for the reply.
>>>>>>>
>>>>>>> Have you tested in environments where O is bigger than H? Wondering
>>>>>>> if the proposed algorithm would help more in those environments (eg. 
>>>>>>> with
>>>>>>> native accelerators)?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi, Qiegang, thanks for the good questions as well
>>>>>>>>
>>>>>>>> please check the following answer
>>>>>>>>
>>>>>>>> > My initial understanding is that Kubernetes will use the Executor
>>>>>>>> Memory Request (H + G) for scheduling decisions, which allows for
>>>>>>>> better resource packing.
>>>>>>>>
>>>>>>>> yes, your understanding is correct
>>>>>>>>
>>>>>>>> > How is the risk of host-level OOM mitigated when the total
>>>>>>>> potential usage  sum of H+G+S across all pods on a node exceeds its
>>>>>>>> allocatable capacity? Does the proposal implicitly rely on the cluster
>>>>>>>> operator to manually ensure an unrequested memory buffer exists on the 
>>>>>>>> node
>>>>>>>> to serve as the shared pool?
>>>>>>>>
>>>>>>>> in PINS, we basically apply a set of strategies, setting
>>>>>>>> conservative bursty factor, progressive rollout, monitor the cluster
>>>>>>>> metrics like Linux Kernel OOMKiller occurrence to guide us to the 
>>>>>>>> optimal
>>>>>>>> setup of bursty factor... in usual, K8S operators will set a reserved 
>>>>>>>> space
>>>>>>>> for daemon processes on each host, we found it is sufficient to in our 
>>>>>>>> case
>>>>>>>> and our major tuning focuses on bursty factor value
>>>>>>>>
>>>>>>>>
>>>>>>>> > Have you considered scheduling optimizations to ensure a
>>>>>>>> strategic mix of executors with large S and small S values on a single
>>>>>>>> node?  I am wondering if this would reduce the probability of 
>>>>>>>> concurrent
>>>>>>>> bursting and host-level OOM.
>>>>>>>>
>>>>>>>> Yes, when we work on this project, we put some attention on the
>>>>>>>> cluster scheduling policy/behavior... two things we mostly care about
>>>>>>>>
>>>>>>>> 1. as stated in the SPIP doc, the cluster should have certain
>>>>>>>> level of diversity of workloads so that we have enough candidates to 
>>>>>>>> form a
>>>>>>>> mixed set of executors with large S and small S values
>>>>>>>>
>>>>>>>> 2. we avoid using binpack scheduling algorithm which tends to pack
>>>>>>>> more pods from the same job to the same host, which can create 
>>>>>>>> troubles as
>>>>>>>> they are more likely to ask for max memory at the same time
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for sharing this interesting proposal.
>>>>>>>>>
>>>>>>>>> My initial understanding is that Kubernetes will use the Executor
>>>>>>>>> Memory Request (H + G) for scheduling decisions, which allows for
>>>>>>>>> better resource packing.  I have a few questions regarding the
>>>>>>>>> shared portion S:
>>>>>>>>>
>>>>>>>>>    1. How is the risk of host-level OOM mitigated when the total
>>>>>>>>>    potential usage  sum of H+G+S across all pods on a node exceeds its
>>>>>>>>>    allocatable capacity? Does the proposal implicitly rely on the 
>>>>>>>>> cluster
>>>>>>>>>    operator to manually ensure an unrequested memory buffer exists on 
>>>>>>>>> the node
>>>>>>>>>    to serve as the shared pool?
>>>>>>>>>    2. Have you considered scheduling optimizations to ensure a
>>>>>>>>>    strategic mix of executors with large S and small S values on
>>>>>>>>>    a single node?  I am wondering if this would reduce the 
>>>>>>>>> probability of
>>>>>>>>>    concurrent bursting and host-level OOM.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I think I'm still missing something in the big picture:
>>>>>>>>>>
>>>>>>>>>>    - Is the memory overhead off-heap? The formular indicates a
>>>>>>>>>>    fixed heap size, and memory overhead can't be dynamic if it's 
>>>>>>>>>> on-heap.
>>>>>>>>>>    - Do Spark applications have static profiles? When we submit
>>>>>>>>>>    stages, the cluster is already allocated, how can we change 
>>>>>>>>>> anything?
>>>>>>>>>>    - How do we assign the shared memory overhead? Fairly among
>>>>>>>>>>    all applications on the same physical node?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> we didn't separate the design into another doc since the main
>>>>>>>>>>> idea is relatively simple...
>>>>>>>>>>>
>>>>>>>>>>> for request/limit calculation, I described it in Q4 of the SPIP
>>>>>>>>>>> doc
>>>>>>>>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0
>>>>>>>>>>>
>>>>>>>>>>> it is calculated based on per profile (you can say it is based
>>>>>>>>>>> on per stage), when the cluster manager compose the pod spec, it 
>>>>>>>>>>> calculates
>>>>>>>>>>> the new memory overhead based on what user asks for in that 
>>>>>>>>>>> resource profile
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Do we have a design sketch? How to determine the memory request
>>>>>>>>>>>> and limit? Is it per stage or per executor?
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> yeah, the implementation is basically relying on the
>>>>>>>>>>>>> request/limit concept in K8S, ...
>>>>>>>>>>>>>
>>>>>>>>>>>>> but if there is any other cluster manager coming in future,
>>>>>>>>>>>>> as long as it has a similar concept , it can leverage this easily 
>>>>>>>>>>>>> as the
>>>>>>>>>>>>> main logic is implemented in ResourceProfile
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> This feature is only available on k8s because it allows
>>>>>>>>>>>>>> containers to have dynamic resources?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <[email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Folks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We are proposing a burst-aware memoryOverhead allocation
>>>>>>>>>>>>>>> algorithm for Spark@K8S to improve memory utilization of
>>>>>>>>>>>>>>> spark clusters.
>>>>>>>>>>>>>>> Please see more details in SPIP doc
>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>.
>>>>>>>>>>>>>>> Feedbacks and discussions are welcomed.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks Chao for being shepard of this feature.
>>>>>>>>>>>>>>> Also want to thank the authors of the original paper
>>>>>>>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from
>>>>>>>>>>>>>>> ByteDance, specifically Rui([email protected]) and Yixin(
>>>>>>>>>>>>>>> [email protected]).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>>> Yao Wang
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Reply via email to