Thanks Yao and Nan for the proposal, and thanks everyone for the detailed
and thoughtful discussion.

Overall, this looks like a valuable addition for organizations running
Spark on Kubernetes, especially given how bursty memoryOverhead usage tends
to be in practice. I appreciate that the change is relatively small in
scope and fully opt-in, which helps keep the risk low.

>From my perspective, the questions raised on the thread and in the SPIP
have been addressed. If others feel the same, do we have consensus to move
forward with a vote? cc Wenchen, Qieqiang, and Karuppayya.

Best,
Chao

On Thu, Dec 11, 2025 at 11:32 PM Nan Zhu <[email protected]> wrote:

> this is a good question
>
> > a stage is bursty and consumes the shared portion and fails to release
> it for subsequent stages
>
> in the scenario you described, since the memory-leaking stage and the
> subsequence ones are from the same job , the pod will likely be killed by
> cgroup oomkiller
>
> taking the following as the example
>
> the usage pattern is  G = 5GB S = 2GB, it uses G + S at max and in theory,
> it should release all 7G and then claim 7G again in some later stages,
> however, due to the memory peak, it holds 2G forever and ask for another
> 7G, as a result,  it hits the pod memory limit  and cgroup oomkiller will
> take action to terminate the pod
>
> so this should be safe to the system
>
>
>
> however, we should be careful about the memory peak for sure, because it
> essentially breaks the assumption that the usage of memoryOverhead is
> bursty (memory peak ~= use memory forever)... unfortunately,
> shared/guaranteed memory is managed by user applications instead of on
> cluster level , they, especially S, are just logical concepts  instead of a
> physical memory pool which pods can explicitly claim memory from...
>
>
> On Thu, Dec 11, 2025 at 10:17 PM karuppayya <[email protected]>
> wrote:
>
>> Thanks for the interesting proposal.
>> The design seems to rely on memoryOverhead being transient.
>> What happens when a stage is bursty and consumes the shared portion and
>> fails to release it for subsequent stages (e.g.,  off-heap buffers and its
>> not garbage collected since its off-heap)? Would this trigger the
>> host-level OOM like described in Q6? or are there strategies to release the
>> shared portion?
>>
>>
>> On Thu, Dec 11, 2025 at 6:24 PM Nan Zhu <[email protected]> wrote:
>>
>>> yes, that's the worst case in the scenario, please check my earlier
>>> response to Qiegang's question, we have a set of strategies adopted in prod
>>> to mitigate the issue
>>>
>>> On Thu, Dec 11, 2025 at 6:21 PM Wenchen Fan <[email protected]> wrote:
>>>
>>>> Thanks for the explanation! So the executor is not guaranteed to get 50
>>>> GB physical memory, right? All pods on the same host may reach peak memory
>>>> usage at the same time and cause paging/swapping which hurts performance?
>>>>
>>>> On Fri, Dec 12, 2025 at 10:12 AM Nan Zhu <[email protected]>
>>>> wrote:
>>>>
>>>>> np, let me try to explain
>>>>>
>>>>> 1. Each executor container will be run in a pod together with some
>>>>> other sidecar containers taking care of tasks like authentication, etc. ,
>>>>> for simplicity, we assume each pod has only one container which is the
>>>>> executor container
>>>>>
>>>>> 2. Each container is assigned with two values, r*equest&limit** (limit
>>>>> >= request),* for both of CPU/memory resources (we only discuss
>>>>> memory here). Each pod will have request/limit values as the sum of all
>>>>> containers belonging to this pod
>>>>>
>>>>> 3. K8S Scheduler chooses a machine to host a pod based on *request*
>>>>> value, and cap the resource usage of each container based on their
>>>>> *limit* value, e.g. if I have a pod with a single container in it ,
>>>>> and it has 1G/2G as request and limit value respectively, any machine with
>>>>> 1G free RAM space will be a candidate to host this pod, and when the
>>>>> container use more than 2G memory, it will be killed by cgroup
>>>>> oomkiller. Once a pod is scheduled to a host, the memory space sized at
>>>>> "sum of all its containers' request values" will be booked exclusively for
>>>>> this pod.
>>>>>
>>>>> 4. By default, Spark *sets request/limit as the same value for
>>>>> executors in k8s*, and this value is basically
>>>>> spark.executor.memory + spark.executor.memoryOverhead in most cases .
>>>>> However,  spark.executor.memoryOverhead usage is very bursty, the user
>>>>> setting  spark.executor.memoryOverhead as 10G usually means each executor
>>>>> only needs 10G in a very small portion of the executor's whole lifecycle
>>>>>
>>>>> 5. The proposed SPIP is essentially to decouple request/limit value in
>>>>> spark@k8s for executors in a safe way (this idea is from the
>>>>> bytedance paper we refer to in SPIP paper).
>>>>>
>>>>> Using the aforementioned example ,
>>>>>
>>>>> if we have a single node cluster with 100G RAM space, we have two pods
>>>>> requesting 40G + 10G (on-heap + memoryOverhead) and we set bursty factor 
>>>>> to
>>>>> 1.2, without the mechanism proposed in this SPIP, we can at most host 2
>>>>> pods with this machine, and because of the bursty usage of that 10G space,
>>>>> the memory utilization would be compromised.
>>>>>
>>>>> When applying the burst-aware memory allocation, we only need 40 + 10
>>>>> - min((40 + 10) * 0.2, 10) = 40G to host each pod, i.e. we have 20G free
>>>>> memory space left in the machine which can be used to host some smaller
>>>>> pods. At the same time, as we didn't change the limit value of the 
>>>>> executor
>>>>> pods, these executors can still use 50G at max.
>>>>>
>>>>>
>>>>> On Thu, Dec 11, 2025 at 5:42 PM Wenchen Fan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Sorry I'm not very familiar with the k8s infra, how does it work
>>>>>> under the hood? The container will adjust its system memory size
>>>>>> depending on the actual memory usage of the processes in this container?
>>>>>>
>>>>>> On Fri, Dec 12, 2025 at 2:49 AM Nan Zhu <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> yeah, we have a few cases that we have significantly larger O than
>>>>>>> H, the proposed algorithm is actually a great fit
>>>>>>>
>>>>>>> as I explained in SPIP doc Appendix C, the proposed algorithm will
>>>>>>> allocate a non-trivial G to ensure the safety of running but still cut a
>>>>>>> big chunk of memory (10s of GBs) and treat them as S , saving tons of 
>>>>>>> money
>>>>>>> burnt by them
>>>>>>>
>>>>>>> but regarding native accelerators, some native acceleration engines
>>>>>>> do not use memoryOverhead but use off-heap (spark.memory.offHeap.size)
>>>>>>> explicitly (e.g. Gluten). The current implementation does not cover this
>>>>>>> part , while that will be an easy extension
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for the reply.
>>>>>>>>
>>>>>>>> Have you tested in environments where O is bigger than H? Wondering
>>>>>>>> if the proposed algorithm would help more in those environments (eg. 
>>>>>>>> with
>>>>>>>> native accelerators)?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi, Qiegang, thanks for the good questions as well
>>>>>>>>>
>>>>>>>>> please check the following answer
>>>>>>>>>
>>>>>>>>> > My initial understanding is that Kubernetes will use the Executor
>>>>>>>>> Memory Request (H + G) for scheduling decisions, which allows for
>>>>>>>>> better resource packing.
>>>>>>>>>
>>>>>>>>> yes, your understanding is correct
>>>>>>>>>
>>>>>>>>> > How is the risk of host-level OOM mitigated when the total
>>>>>>>>> potential usage  sum of H+G+S across all pods on a node exceeds its
>>>>>>>>> allocatable capacity? Does the proposal implicitly rely on the cluster
>>>>>>>>> operator to manually ensure an unrequested memory buffer exists on 
>>>>>>>>> the node
>>>>>>>>> to serve as the shared pool?
>>>>>>>>>
>>>>>>>>> in PINS, we basically apply a set of strategies, setting
>>>>>>>>> conservative bursty factor, progressive rollout, monitor the cluster
>>>>>>>>> metrics like Linux Kernel OOMKiller occurrence to guide us to the 
>>>>>>>>> optimal
>>>>>>>>> setup of bursty factor... in usual, K8S operators will set a reserved 
>>>>>>>>> space
>>>>>>>>> for daemon processes on each host, we found it is sufficient to in 
>>>>>>>>> our case
>>>>>>>>> and our major tuning focuses on bursty factor value
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > Have you considered scheduling optimizations to ensure a
>>>>>>>>> strategic mix of executors with large S and small S values on a single
>>>>>>>>> node?  I am wondering if this would reduce the probability of 
>>>>>>>>> concurrent
>>>>>>>>> bursting and host-level OOM.
>>>>>>>>>
>>>>>>>>> Yes, when we work on this project, we put some attention on the
>>>>>>>>> cluster scheduling policy/behavior... two things we mostly care about
>>>>>>>>>
>>>>>>>>> 1. as stated in the SPIP doc, the cluster should have certain
>>>>>>>>> level of diversity of workloads so that we have enough candidates to 
>>>>>>>>> form a
>>>>>>>>> mixed set of executors with large S and small S values
>>>>>>>>>
>>>>>>>>> 2. we avoid using binpack scheduling algorithm which tends to pack
>>>>>>>>> more pods from the same job to the same host, which can create 
>>>>>>>>> troubles as
>>>>>>>>> they are more likely to ask for max memory at the same time
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for sharing this interesting proposal.
>>>>>>>>>>
>>>>>>>>>> My initial understanding is that Kubernetes will use the Executor
>>>>>>>>>> Memory Request (H + G) for scheduling decisions, which allows
>>>>>>>>>> for better resource packing.  I have a few questions regarding
>>>>>>>>>> the shared portion S:
>>>>>>>>>>
>>>>>>>>>>    1. How is the risk of host-level OOM mitigated when the total
>>>>>>>>>>    potential usage  sum of H+G+S across all pods on a node exceeds 
>>>>>>>>>> its
>>>>>>>>>>    allocatable capacity? Does the proposal implicitly rely on the 
>>>>>>>>>> cluster
>>>>>>>>>>    operator to manually ensure an unrequested memory buffer exists 
>>>>>>>>>> on the node
>>>>>>>>>>    to serve as the shared pool?
>>>>>>>>>>    2. Have you considered scheduling optimizations to ensure a
>>>>>>>>>>    strategic mix of executors with large S and small S values on
>>>>>>>>>>    a single node?  I am wondering if this would reduce the 
>>>>>>>>>> probability of
>>>>>>>>>>    concurrent bursting and host-level OOM.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I think I'm still missing something in the big picture:
>>>>>>>>>>>
>>>>>>>>>>>    - Is the memory overhead off-heap? The formular indicates a
>>>>>>>>>>>    fixed heap size, and memory overhead can't be dynamic if it's 
>>>>>>>>>>> on-heap.
>>>>>>>>>>>    - Do Spark applications have static profiles? When we submit
>>>>>>>>>>>    stages, the cluster is already allocated, how can we change 
>>>>>>>>>>> anything?
>>>>>>>>>>>    - How do we assign the shared memory overhead? Fairly among
>>>>>>>>>>>    all applications on the same physical node?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> we didn't separate the design into another doc since the main
>>>>>>>>>>>> idea is relatively simple...
>>>>>>>>>>>>
>>>>>>>>>>>> for request/limit calculation, I described it in Q4 of the SPIP
>>>>>>>>>>>> doc
>>>>>>>>>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0
>>>>>>>>>>>>
>>>>>>>>>>>> it is calculated based on per profile (you can say it is based
>>>>>>>>>>>> on per stage), when the cluster manager compose the pod spec, it 
>>>>>>>>>>>> calculates
>>>>>>>>>>>> the new memory overhead based on what user asks for in that 
>>>>>>>>>>>> resource profile
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Do we have a design sketch? How to determine the memory
>>>>>>>>>>>>> request and limit? Is it per stage or per executor?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> yeah, the implementation is basically relying on the
>>>>>>>>>>>>>> request/limit concept in K8S, ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> but if there is any other cluster manager coming in future,
>>>>>>>>>>>>>> as long as it has a similar concept , it can leverage this 
>>>>>>>>>>>>>> easily as the
>>>>>>>>>>>>>> main logic is implemented in ResourceProfile
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This feature is only available on k8s because it allows
>>>>>>>>>>>>>>> containers to have dynamic resources?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <[email protected]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Folks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We are proposing a burst-aware memoryOverhead allocation
>>>>>>>>>>>>>>>> algorithm for Spark@K8S to improve memory utilization of
>>>>>>>>>>>>>>>> spark clusters.
>>>>>>>>>>>>>>>> Please see more details in SPIP doc
>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>.
>>>>>>>>>>>>>>>> Feedbacks and discussions are welcomed.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks Chao for being shepard of this feature.
>>>>>>>>>>>>>>>> Also want to thank the authors of the original paper
>>>>>>>>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from
>>>>>>>>>>>>>>>> ByteDance, specifically Rui([email protected])
>>>>>>>>>>>>>>>> and Yixin([email protected]).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>>>> Yao Wang
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Reply via email to