Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

karuppayya Thu, 11 Dec 2025 22:17:06 -0800

Thanks for the interesting proposal.
The design seems to rely on memoryOverhead being transient.
What happens when a stage is bursty and consumes the shared portion and
fails to release it for subsequent stages (e.g.,  off-heap buffers and its
not garbage collected since its off-heap)? Would this trigger the
host-level OOM like described in Q6? or are there strategies to release the
shared portion?



On Thu, Dec 11, 2025 at 6:24 PM Nan Zhu <[email protected]> wrote:

> yes, that's the worst case in the scenario, please check my earlier
> response to Qiegang's question, we have a set of strategies adopted in prod
> to mitigate the issue
>
> On Thu, Dec 11, 2025 at 6:21 PM Wenchen Fan <[email protected]> wrote:
>
>> Thanks for the explanation! So the executor is not guaranteed to get 50
>> GB physical memory, right? All pods on the same host may reach peak memory
>> usage at the same time and cause paging/swapping which hurts performance?
>>
>> On Fri, Dec 12, 2025 at 10:12 AM Nan Zhu <[email protected]> wrote:
>>
>>> np, let me try to explain
>>>
>>> 1. Each executor container will be run in a pod together with some other
>>> sidecar containers taking care of tasks like authentication, etc. , for
>>> simplicity, we assume each pod has only one container which is the executor
>>> container
>>>
>>> 2. Each container is assigned with two values, r*equest&limit** (limit
>>> >= request),* for both of CPU/memory resources (we only discuss memory
>>> here). Each pod will have request/limit values as the sum of all containers
>>> belonging to this pod
>>>
>>> 3. K8S Scheduler chooses a machine to host a pod based on *request*
>>> value, and cap the resource usage of each container based on their
>>> *limit* value, e.g. if I have a pod with a single container in it , and
>>> it has 1G/2G as request and limit value respectively, any machine with 1G
>>> free RAM space will be a candidate to host this pod, and when the container
>>> use more than 2G memory, it will be killed by cgroup oomkiller. Once a pod
>>> is scheduled to a host, the memory space sized at "sum of all its
>>> containers' request values" will be booked exclusively for this pod.
>>>
>>> 4. By default, Spark *sets request/limit as the same value for
>>> executors in k8s*, and this value is basically spark.executor.memory +
>>> spark.executor.memoryOverhead in most cases . However,
>>> spark.executor.memoryOverhead usage is very bursty, the user setting
>>> spark.executor.memoryOverhead as 10G usually means each executor only needs
>>> 10G in a very small portion of the executor's whole lifecycle
>>>
>>> 5. The proposed SPIP is essentially to decouple request/limit value in
>>> spark@k8s for executors in a safe way (this idea is from the bytedance
>>> paper we refer to in SPIP paper).
>>>
>>> Using the aforementioned example ,
>>>
>>> if we have a single node cluster with 100G RAM space, we have two pods
>>> requesting 40G + 10G (on-heap + memoryOverhead) and we set bursty factor to
>>> 1.2, without the mechanism proposed in this SPIP, we can at most host 2
>>> pods with this machine, and because of the bursty usage of that 10G space,
>>> the memory utilization would be compromised.
>>>
>>> When applying the burst-aware memory allocation, we only need 40 + 10 -
>>> min((40 + 10) * 0.2, 10) = 40G to host each pod, i.e. we have 20G free
>>> memory space left in the machine which can be used to host some smaller
>>> pods. At the same time, as we didn't change the limit value of the executor
>>> pods, these executors can still use 50G at max.
>>>
>>>
>>> On Thu, Dec 11, 2025 at 5:42 PM Wenchen Fan <[email protected]> wrote:
>>>
>>>> Sorry I'm not very familiar with the k8s infra, how does it work under
>>>> the hood? The container will adjust its system memory size depending on the
>>>> actual memory usage of the processes in this container?
>>>>
>>>> On Fri, Dec 12, 2025 at 2:49 AM Nan Zhu <[email protected]> wrote:
>>>>
>>>>> yeah, we have a few cases that we have significantly larger O than
>>>>> H, the proposed algorithm is actually a great fit
>>>>>
>>>>> as I explained in SPIP doc Appendix C, the proposed algorithm will
>>>>> allocate a non-trivial G to ensure the safety of running but still cut a
>>>>> big chunk of memory (10s of GBs) and treat them as S , saving tons of 
>>>>> money
>>>>> burnt by them
>>>>>
>>>>> but regarding native accelerators, some native acceleration engines do
>>>>> not use memoryOverhead but use off-heap (spark.memory.offHeap.size)
>>>>> explicitly (e.g. Gluten). The current implementation does not cover this
>>>>> part , while that will be an easy extension
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks for the reply.
>>>>>>
>>>>>> Have you tested in environments where O is bigger than H? Wondering
>>>>>> if the proposed algorithm would help more in those environments (eg. with
>>>>>> native accelerators)?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, Qiegang, thanks for the good questions as well
>>>>>>>
>>>>>>> please check the following answer
>>>>>>>
>>>>>>> > My initial understanding is that Kubernetes will use the Executor
>>>>>>> Memory Request (H + G) for scheduling decisions, which allows for
>>>>>>> better resource packing.
>>>>>>>
>>>>>>> yes, your understanding is correct
>>>>>>>
>>>>>>> > How is the risk of host-level OOM mitigated when the total
>>>>>>> potential usage  sum of H+G+S across all pods on a node exceeds its
>>>>>>> allocatable capacity? Does the proposal implicitly rely on the cluster
>>>>>>> operator to manually ensure an unrequested memory buffer exists on the 
>>>>>>> node
>>>>>>> to serve as the shared pool?
>>>>>>>
>>>>>>> in PINS, we basically apply a set of strategies, setting
>>>>>>> conservative bursty factor, progressive rollout, monitor the cluster
>>>>>>> metrics like Linux Kernel OOMKiller occurrence to guide us to the 
>>>>>>> optimal
>>>>>>> setup of bursty factor... in usual, K8S operators will set a reserved 
>>>>>>> space
>>>>>>> for daemon processes on each host, we found it is sufficient to in our 
>>>>>>> case
>>>>>>> and our major tuning focuses on bursty factor value
>>>>>>>
>>>>>>>
>>>>>>> > Have you considered scheduling optimizations to ensure a strategic
>>>>>>> mix of executors with large S and small S values on a single node?  I am
>>>>>>> wondering if this would reduce the probability of concurrent bursting 
>>>>>>> and
>>>>>>> host-level OOM.
>>>>>>>
>>>>>>> Yes, when we work on this project, we put some attention on the
>>>>>>> cluster scheduling policy/behavior... two things we mostly care about
>>>>>>>
>>>>>>> 1. as stated in the SPIP doc, the cluster should have certain level
>>>>>>> of diversity of workloads so that we have enough candidates to form a 
>>>>>>> mixed
>>>>>>> set of executors with large S and small S values
>>>>>>>
>>>>>>> 2. we avoid using binpack scheduling algorithm which tends to pack
>>>>>>> more pods from the same job to the same host, which can create troubles 
>>>>>>> as
>>>>>>> they are more likely to ask for max memory at the same time
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for sharing this interesting proposal.
>>>>>>>>
>>>>>>>> My initial understanding is that Kubernetes will use the Executor
>>>>>>>> Memory Request (H + G) for scheduling decisions, which allows for
>>>>>>>> better resource packing.  I have a few questions regarding the
>>>>>>>> shared portion S:
>>>>>>>>
>>>>>>>>    1. How is the risk of host-level OOM mitigated when the total
>>>>>>>>    potential usage  sum of H+G+S across all pods on a node exceeds its
>>>>>>>>    allocatable capacity? Does the proposal implicitly rely on the 
>>>>>>>> cluster
>>>>>>>>    operator to manually ensure an unrequested memory buffer exists on 
>>>>>>>> the node
>>>>>>>>    to serve as the shared pool?
>>>>>>>>    2. Have you considered scheduling optimizations to ensure a
>>>>>>>>    strategic mix of executors with large S and small S values on a
>>>>>>>>    single node?  I am wondering if this would reduce the probability of
>>>>>>>>    concurrent bursting and host-level OOM.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I think I'm still missing something in the big picture:
>>>>>>>>>
>>>>>>>>>    - Is the memory overhead off-heap? The formular indicates a
>>>>>>>>>    fixed heap size, and memory overhead can't be dynamic if it's 
>>>>>>>>> on-heap.
>>>>>>>>>    - Do Spark applications have static profiles? When we submit
>>>>>>>>>    stages, the cluster is already allocated, how can we change 
>>>>>>>>> anything?
>>>>>>>>>    - How do we assign the shared memory overhead? Fairly among
>>>>>>>>>    all applications on the same physical node?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> we didn't separate the design into another doc since the main
>>>>>>>>>> idea is relatively simple...
>>>>>>>>>>
>>>>>>>>>> for request/limit calculation, I described it in Q4 of the SPIP
>>>>>>>>>> doc
>>>>>>>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0
>>>>>>>>>>
>>>>>>>>>> it is calculated based on per profile (you can say it is based on
>>>>>>>>>> per stage), when the cluster manager compose the pod spec, it 
>>>>>>>>>> calculates
>>>>>>>>>> the new memory overhead based on what user asks for in that resource 
>>>>>>>>>> profile
>>>>>>>>>>
>>>>>>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Do we have a design sketch? How to determine the memory request
>>>>>>>>>>> and limit? Is it per stage or per executor?
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> yeah, the implementation is basically relying on the
>>>>>>>>>>>> request/limit concept in K8S, ...
>>>>>>>>>>>>
>>>>>>>>>>>> but if there is any other cluster manager coming in future,  as
>>>>>>>>>>>> long as it has a similar concept , it can leverage this easily as 
>>>>>>>>>>>> the main
>>>>>>>>>>>> logic is implemented in ResourceProfile
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> This feature is only available on k8s because it allows
>>>>>>>>>>>>> containers to have dynamic resources?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Folks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We are proposing a burst-aware memoryOverhead allocation
>>>>>>>>>>>>>> algorithm for Spark@K8S to improve memory utilization of
>>>>>>>>>>>>>> spark clusters.
>>>>>>>>>>>>>> Please see more details in SPIP doc
>>>>>>>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>.
>>>>>>>>>>>>>> Feedbacks and discussions are welcomed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks Chao for being shepard of this feature.
>>>>>>>>>>>>>> Also want to thank the authors of the original paper
>>>>>>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from
>>>>>>>>>>>>>> ByteDance, specifically Rui([email protected]) and Yixin(
>>>>>>>>>>>>>> [email protected]).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>> Yao Wang
>>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Reply via email to