Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Qiegang Long Wed, 17 Dec 2025 04:34:30 -0800

+1

On Wed, Dec 17, 2025, 2:48 AM Wenchen Fan <[email protected]> wrote:


> +1
>
> On Wed, Dec 17, 2025 at 6:41 AM karuppayya <[email protected]>
> wrote:
>
>> +1 from me.
>> I think it's well-scoped and takes advantage of Kubernetes' features
>> exactly for what they are designed for(as per my understanding).
>>
>> On Tue, Dec 16, 2025 at 8:17 AM Chao Sun <[email protected]> wrote:
>>
>>> Thanks Yao and Nan for the proposal, and thanks everyone for the
>>> detailed and thoughtful discussion.
>>>
>>> Overall, this looks like a valuable addition for organizations running
>>> Spark on Kubernetes, especially given how bursty memoryOverhead usage
>>> tends to be in practice. I appreciate that the change is relatively small
>>> in scope and fully opt-in, which helps keep the risk low.
>>>
>>> From my perspective, the questions raised on the thread and in the SPIP
>>> have been addressed. If others feel the same, do we have consensus to move
>>> forward with a vote? cc Wenchen, Qieqiang, and Karuppayya.
>>>
>>> Best,
>>> Chao
>>>
>>> On Thu, Dec 11, 2025 at 11:32 PM Nan Zhu <[email protected]> wrote:
>>>
>>>> this is a good question
>>>>
>>>> > a stage is bursty and consumes the shared portion and fails to
>>>> release it for subsequent stages
>>>>
>>>> in the scenario you described, since the memory-leaking stage and the
>>>> subsequence ones are from the same job , the pod will likely be killed by
>>>> cgroup oomkiller
>>>>
>>>> taking the following as the example
>>>>
>>>> the usage pattern is  G = 5GB S = 2GB, it uses G + S at max and in
>>>> theory, it should release all 7G and then claim 7G again in some later
>>>> stages, however, due to the memory peak, it holds 2G forever and ask for
>>>> another 7G, as a result,  it hits the pod memory limit  and cgroup
>>>> oomkiller will take action to terminate the pod
>>>>
>>>> so this should be safe to the system
>>>>
>>>>
>>>>
>>>> however, we should be careful about the memory peak for sure, because
>>>> it essentially breaks the assumption that the usage of memoryOverhead is
>>>> bursty (memory peak ~= use memory forever)... unfortunately,
>>>> shared/guaranteed memory is managed by user applications instead of on
>>>> cluster level , they, especially S, are just logical concepts  instead of a
>>>> physical memory pool which pods can explicitly claim memory from...
>>>>
>>>>
>>>> On Thu, Dec 11, 2025 at 10:17 PM karuppayya <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks for the interesting proposal.
>>>>> The design seems to rely on memoryOverhead being transient.
>>>>> What happens when a stage is bursty and consumes the shared portion
>>>>> and fails to release it for subsequent stages (e.g.,  off-heap buffers and
>>>>> its not garbage collected since its off-heap)? Would this trigger the
>>>>> host-level OOM like described in Q6? or are there strategies to release 
>>>>> the
>>>>> shared portion?
>>>>>
>>>>>
>>>>> On Thu, Dec 11, 2025 at 6:24 PM Nan Zhu <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> yes, that's the worst case in the scenario, please check my earlier
>>>>>> response to Qiegang's question, we have a set of strategies adopted in 
>>>>>> prod
>>>>>> to mitigate the issue
>>>>>>
>>>>>> On Thu, Dec 11, 2025 at 6:21 PM Wenchen Fan <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for the explanation! So the executor is not guaranteed to get
>>>>>>> 50 GB physical memory, right? All pods on the same host may reach peak
>>>>>>> memory usage at the same time and cause paging/swapping which hurts
>>>>>>> performance?
>>>>>>>
>>>>>>> On Fri, Dec 12, 2025 at 10:12 AM Nan Zhu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> np, let me try to explain
>>>>>>>>
>>>>>>>> 1. Each executor container will be run in a pod together with some
>>>>>>>> other sidecar containers taking care of tasks like authentication, 
>>>>>>>> etc. ,
>>>>>>>> for simplicity, we assume each pod has only one container which is the
>>>>>>>> executor container
>>>>>>>>
>>>>>>>> 2. Each container is assigned with two values, r*equest&limit** (limit
>>>>>>>> >= request),* for both of CPU/memory resources (we only discuss
>>>>>>>> memory here). Each pod will have request/limit values as the sum of all
>>>>>>>> containers belonging to this pod
>>>>>>>>
>>>>>>>> 3. K8S Scheduler chooses a machine to host a pod based on *request*
>>>>>>>> value, and cap the resource usage of each container based on their
>>>>>>>> *limit* value, e.g. if I have a pod with a single container in it
>>>>>>>> , and it has 1G/2G as request and limit value respectively, any machine
>>>>>>>> with 1G free RAM space will be a candidate to host this pod, and when 
>>>>>>>> the
>>>>>>>> container use more than 2G memory, it will be killed by cgroup
>>>>>>>> oomkiller. Once a pod is scheduled to a host, the memory space sized at
>>>>>>>> "sum of all its containers' request values" will be booked exclusively 
>>>>>>>> for
>>>>>>>> this pod.
>>>>>>>>
>>>>>>>> 4. By default, Spark *sets request/limit as the same value for
>>>>>>>> executors in k8s*, and this value is basically
>>>>>>>> spark.executor.memory + spark.executor.memoryOverhead in most cases .
>>>>>>>> However,  spark.executor.memoryOverhead usage is very bursty, the user
>>>>>>>> setting  spark.executor.memoryOverhead as 10G usually means each 
>>>>>>>> executor
>>>>>>>> only needs 10G in a very small portion of the executor's whole 
>>>>>>>> lifecycle
>>>>>>>>
>>>>>>>> 5. The proposed SPIP is essentially to decouple request/limit value
>>>>>>>> in spark@k8s for executors in a safe way (this idea is from the
>>>>>>>> bytedance paper we refer to in SPIP paper).
>>>>>>>>
>>>>>>>> Using the aforementioned example ,
>>>>>>>>
>>>>>>>> if we have a single node cluster with 100G RAM space, we have two
>>>>>>>> pods requesting 40G + 10G (on-heap + memoryOverhead) and we set bursty
>>>>>>>> factor to 1.2, without the mechanism proposed in this SPIP, we can at 
>>>>>>>> most
>>>>>>>> host 2 pods with this machine, and because of the bursty usage of that 
>>>>>>>> 10G
>>>>>>>> space, the memory utilization would be compromised.
>>>>>>>>
>>>>>>>> When applying the burst-aware memory allocation, we only need 40 +
>>>>>>>> 10 - min((40 + 10) * 0.2, 10) = 40G to host each pod, i.e. we have 20G 
>>>>>>>> free
>>>>>>>> memory space left in the machine which can be used to host some smaller
>>>>>>>> pods. At the same time, as we didn't change the limit value of the 
>>>>>>>> executor
>>>>>>>> pods, these executors can still use 50G at max.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Dec 11, 2025 at 5:42 PM Wenchen Fan <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Sorry I'm not very familiar with the k8s infra, how does it work
>>>>>>>>> under the hood? The container will adjust its system memory size
>>>>>>>>> depending on the actual memory usage of the processes in this 
>>>>>>>>> container?
>>>>>>>>>
>>>>>>>>> On Fri, Dec 12, 2025 at 2:49 AM Nan Zhu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> yeah, we have a few cases that we have significantly larger O
>>>>>>>>>> than H, the proposed algorithm is actually a great fit
>>>>>>>>>>
>>>>>>>>>> as I explained in SPIP doc Appendix C, the proposed algorithm
>>>>>>>>>> will allocate a non-trivial G to ensure the safety of running but 
>>>>>>>>>> still cut
>>>>>>>>>> a big chunk of memory (10s of GBs) and treat them as S , saving tons 
>>>>>>>>>> of
>>>>>>>>>> money burnt by them
>>>>>>>>>>
>>>>>>>>>> but regarding native accelerators, some native acceleration
>>>>>>>>>> engines do not use memoryOverhead but use off-heap
>>>>>>>>>> (spark.memory.offHeap.size) explicitly (e.g. Gluten). The current
>>>>>>>>>> implementation does not cover this part , while that will be an easy
>>>>>>>>>> extension
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for the reply.
>>>>>>>>>>>
>>>>>>>>>>> Have you tested in environments where O is bigger than H?
>>>>>>>>>>> Wondering if the proposed algorithm would help more in those 
>>>>>>>>>>> environments
>>>>>>>>>>> (eg. with
>>>>>>>>>>> native accelerators)?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi, Qiegang, thanks for the good questions as well
>>>>>>>>>>>>
>>>>>>>>>>>> please check the following answer
>>>>>>>>>>>>
>>>>>>>>>>>> > My initial understanding is that Kubernetes will use the Executor
>>>>>>>>>>>> Memory Request (H + G) for scheduling decisions, which allows
>>>>>>>>>>>> for better resource packing.
>>>>>>>>>>>>
>>>>>>>>>>>> yes, your understanding is correct
>>>>>>>>>>>>
>>>>>>>>>>>> > How is the risk of host-level OOM mitigated when the total
>>>>>>>>>>>> potential usage  sum of H+G+S across all pods on a node exceeds its
>>>>>>>>>>>> allocatable capacity? Does the proposal implicitly rely on the 
>>>>>>>>>>>> cluster
>>>>>>>>>>>> operator to manually ensure an unrequested memory buffer exists on 
>>>>>>>>>>>> the node
>>>>>>>>>>>> to serve as the shared pool?
>>>>>>>>>>>>
>>>>>>>>>>>> in PINS, we basically apply a set of strategies, setting
>>>>>>>>>>>> conservative bursty factor, progressive rollout, monitor the 
>>>>>>>>>>>> cluster
>>>>>>>>>>>> metrics like Linux Kernel OOMKiller occurrence to guide us to the 
>>>>>>>>>>>> optimal
>>>>>>>>>>>> setup of bursty factor... in usual, K8S operators will set a 
>>>>>>>>>>>> reserved space
>>>>>>>>>>>> for daemon processes on each host, we found it is sufficient to in 
>>>>>>>>>>>> our case
>>>>>>>>>>>> and our major tuning focuses on bursty factor value
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> > Have you considered scheduling optimizations to ensure a
>>>>>>>>>>>> strategic mix of executors with large S and small S values on a 
>>>>>>>>>>>> single
>>>>>>>>>>>> node?  I am wondering if this would reduce the probability of 
>>>>>>>>>>>> concurrent
>>>>>>>>>>>> bursting and host-level OOM.
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, when we work on this project, we put some attention on the
>>>>>>>>>>>> cluster scheduling policy/behavior... two things we mostly care 
>>>>>>>>>>>> about
>>>>>>>>>>>>
>>>>>>>>>>>> 1. as stated in the SPIP doc, the cluster should have certain
>>>>>>>>>>>> level of diversity of workloads so that we have enough candidates 
>>>>>>>>>>>> to form a
>>>>>>>>>>>> mixed set of executors with large S and small S values
>>>>>>>>>>>>
>>>>>>>>>>>> 2. we avoid using binpack scheduling algorithm which tends to
>>>>>>>>>>>> pack more pods from the same job to the same host, which can create
>>>>>>>>>>>> troubles as they are more likely to ask for max memory at the same 
>>>>>>>>>>>> time
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for sharing this interesting proposal.
>>>>>>>>>>>>>
>>>>>>>>>>>>> My initial understanding is that Kubernetes will use the Executor
>>>>>>>>>>>>> Memory Request (H + G) for scheduling decisions, which allows
>>>>>>>>>>>>> for better resource packing.  I have a few questions
>>>>>>>>>>>>> regarding the shared portion S:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    1. How is the risk of host-level OOM mitigated when the
>>>>>>>>>>>>>    total potential usage  sum of H+G+S across all pods on a node 
>>>>>>>>>>>>> exceeds its
>>>>>>>>>>>>>    allocatable capacity? Does the proposal implicitly rely on the 
>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>    operator to manually ensure an unrequested memory buffer 
>>>>>>>>>>>>> exists on the node
>>>>>>>>>>>>>    to serve as the shared pool?
>>>>>>>>>>>>>    2. Have you considered scheduling optimizations to ensure
>>>>>>>>>>>>>    a strategic mix of executors with large S and small S values
>>>>>>>>>>>>>    on a single node?  I am wondering if this would reduce the 
>>>>>>>>>>>>> probability of
>>>>>>>>>>>>>    concurrent bursting and host-level OOM.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think I'm still missing something in the big picture:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - Is the memory overhead off-heap? The formular indicates
>>>>>>>>>>>>>>    a fixed heap size, and memory overhead can't be dynamic if 
>>>>>>>>>>>>>> it's on-heap.
>>>>>>>>>>>>>>    - Do Spark applications have static profiles? When we
>>>>>>>>>>>>>>    submit stages, the cluster is already allocated, how can we 
>>>>>>>>>>>>>> change anything?
>>>>>>>>>>>>>>    - How do we assign the shared memory overhead? Fairly
>>>>>>>>>>>>>>    among all applications on the same physical node?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> we didn't separate the design into another doc since the
>>>>>>>>>>>>>>> main idea is relatively simple...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> for request/limit calculation, I described it in Q4 of the
>>>>>>>>>>>>>>> SPIP doc
>>>>>>>>>>>>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> it is calculated based on per profile (you can say it is
>>>>>>>>>>>>>>> based on per stage), when the cluster manager compose the pod 
>>>>>>>>>>>>>>> spec, it
>>>>>>>>>>>>>>> calculates the new memory overhead based on what user asks for 
>>>>>>>>>>>>>>> in that
>>>>>>>>>>>>>>> resource profile
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Do we have a design sketch? How to determine the memory
>>>>>>>>>>>>>>>> request and limit? Is it per stage or per executor?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> yeah, the implementation is basically relying on the
>>>>>>>>>>>>>>>>> request/limit concept in K8S, ...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> but if there is any other cluster manager coming in
>>>>>>>>>>>>>>>>> future,  as long as it has a similar concept , it can 
>>>>>>>>>>>>>>>>> leverage this easily
>>>>>>>>>>>>>>>>> as the main logic is implemented in ResourceProfile
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This feature is only available on k8s because it allows
>>>>>>>>>>>>>>>>>> containers to have dynamic resources?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Folks,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> We are proposing a burst-aware memoryOverhead allocation
>>>>>>>>>>>>>>>>>>> algorithm for Spark@K8S to improve memory utilization
>>>>>>>>>>>>>>>>>>> of spark clusters.
>>>>>>>>>>>>>>>>>>> Please see more details in SPIP doc
>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>.
>>>>>>>>>>>>>>>>>>> Feedbacks and discussions are welcomed.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks Chao for being shepard of this feature.
>>>>>>>>>>>>>>>>>>> Also want to thank the authors of the original paper
>>>>>>>>>>>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from
>>>>>>>>>>>>>>>>>>> ByteDance, specifically Rui([email protected])
>>>>>>>>>>>>>>>>>>> and Yixin([email protected]).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>>>>>>> Yao Wang
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>

Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Reply via email to