Thanks for the interesting proposal. The design seems to rely on memoryOverhead being transient. What happens when a stage is bursty and consumes the shared portion and fails to release it for subsequent stages (e.g., off-heap buffers and its not garbage collected since its off-heap)? Would this trigger the host-level OOM like described in Q6? or are there strategies to release the shared portion?
On Thu, Dec 11, 2025 at 6:24 PM Nan Zhu <[email protected]> wrote: > yes, that's the worst case in the scenario, please check my earlier > response to Qiegang's question, we have a set of strategies adopted in prod > to mitigate the issue > > On Thu, Dec 11, 2025 at 6:21 PM Wenchen Fan <[email protected]> wrote: > >> Thanks for the explanation! So the executor is not guaranteed to get 50 >> GB physical memory, right? All pods on the same host may reach peak memory >> usage at the same time and cause paging/swapping which hurts performance? >> >> On Fri, Dec 12, 2025 at 10:12 AM Nan Zhu <[email protected]> wrote: >> >>> np, let me try to explain >>> >>> 1. Each executor container will be run in a pod together with some other >>> sidecar containers taking care of tasks like authentication, etc. , for >>> simplicity, we assume each pod has only one container which is the executor >>> container >>> >>> 2. Each container is assigned with two values, r*equest&limit** (limit >>> >= request),* for both of CPU/memory resources (we only discuss memory >>> here). Each pod will have request/limit values as the sum of all containers >>> belonging to this pod >>> >>> 3. K8S Scheduler chooses a machine to host a pod based on *request* >>> value, and cap the resource usage of each container based on their >>> *limit* value, e.g. if I have a pod with a single container in it , and >>> it has 1G/2G as request and limit value respectively, any machine with 1G >>> free RAM space will be a candidate to host this pod, and when the container >>> use more than 2G memory, it will be killed by cgroup oomkiller. Once a pod >>> is scheduled to a host, the memory space sized at "sum of all its >>> containers' request values" will be booked exclusively for this pod. >>> >>> 4. By default, Spark *sets request/limit as the same value for >>> executors in k8s*, and this value is basically spark.executor.memory + >>> spark.executor.memoryOverhead in most cases . However, >>> spark.executor.memoryOverhead usage is very bursty, the user setting >>> spark.executor.memoryOverhead as 10G usually means each executor only needs >>> 10G in a very small portion of the executor's whole lifecycle >>> >>> 5. The proposed SPIP is essentially to decouple request/limit value in >>> spark@k8s for executors in a safe way (this idea is from the bytedance >>> paper we refer to in SPIP paper). >>> >>> Using the aforementioned example , >>> >>> if we have a single node cluster with 100G RAM space, we have two pods >>> requesting 40G + 10G (on-heap + memoryOverhead) and we set bursty factor to >>> 1.2, without the mechanism proposed in this SPIP, we can at most host 2 >>> pods with this machine, and because of the bursty usage of that 10G space, >>> the memory utilization would be compromised. >>> >>> When applying the burst-aware memory allocation, we only need 40 + 10 - >>> min((40 + 10) * 0.2, 10) = 40G to host each pod, i.e. we have 20G free >>> memory space left in the machine which can be used to host some smaller >>> pods. At the same time, as we didn't change the limit value of the executor >>> pods, these executors can still use 50G at max. >>> >>> >>> On Thu, Dec 11, 2025 at 5:42 PM Wenchen Fan <[email protected]> wrote: >>> >>>> Sorry I'm not very familiar with the k8s infra, how does it work under >>>> the hood? The container will adjust its system memory size depending on the >>>> actual memory usage of the processes in this container? >>>> >>>> On Fri, Dec 12, 2025 at 2:49 AM Nan Zhu <[email protected]> wrote: >>>> >>>>> yeah, we have a few cases that we have significantly larger O than >>>>> H, the proposed algorithm is actually a great fit >>>>> >>>>> as I explained in SPIP doc Appendix C, the proposed algorithm will >>>>> allocate a non-trivial G to ensure the safety of running but still cut a >>>>> big chunk of memory (10s of GBs) and treat them as S , saving tons of >>>>> money >>>>> burnt by them >>>>> >>>>> but regarding native accelerators, some native acceleration engines do >>>>> not use memoryOverhead but use off-heap (spark.memory.offHeap.size) >>>>> explicitly (e.g. Gluten). The current implementation does not cover this >>>>> part , while that will be an easy extension >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long <[email protected]> >>>>> wrote: >>>>> >>>>>> Thanks for the reply. >>>>>> >>>>>> Have you tested in environments where O is bigger than H? Wondering >>>>>> if the proposed algorithm would help more in those environments (eg. with >>>>>> native accelerators)? >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi, Qiegang, thanks for the good questions as well >>>>>>> >>>>>>> please check the following answer >>>>>>> >>>>>>> > My initial understanding is that Kubernetes will use the Executor >>>>>>> Memory Request (H + G) for scheduling decisions, which allows for >>>>>>> better resource packing. >>>>>>> >>>>>>> yes, your understanding is correct >>>>>>> >>>>>>> > How is the risk of host-level OOM mitigated when the total >>>>>>> potential usage sum of H+G+S across all pods on a node exceeds its >>>>>>> allocatable capacity? Does the proposal implicitly rely on the cluster >>>>>>> operator to manually ensure an unrequested memory buffer exists on the >>>>>>> node >>>>>>> to serve as the shared pool? >>>>>>> >>>>>>> in PINS, we basically apply a set of strategies, setting >>>>>>> conservative bursty factor, progressive rollout, monitor the cluster >>>>>>> metrics like Linux Kernel OOMKiller occurrence to guide us to the >>>>>>> optimal >>>>>>> setup of bursty factor... in usual, K8S operators will set a reserved >>>>>>> space >>>>>>> for daemon processes on each host, we found it is sufficient to in our >>>>>>> case >>>>>>> and our major tuning focuses on bursty factor value >>>>>>> >>>>>>> >>>>>>> > Have you considered scheduling optimizations to ensure a strategic >>>>>>> mix of executors with large S and small S values on a single node? I am >>>>>>> wondering if this would reduce the probability of concurrent bursting >>>>>>> and >>>>>>> host-level OOM. >>>>>>> >>>>>>> Yes, when we work on this project, we put some attention on the >>>>>>> cluster scheduling policy/behavior... two things we mostly care about >>>>>>> >>>>>>> 1. as stated in the SPIP doc, the cluster should have certain level >>>>>>> of diversity of workloads so that we have enough candidates to form a >>>>>>> mixed >>>>>>> set of executors with large S and small S values >>>>>>> >>>>>>> 2. we avoid using binpack scheduling algorithm which tends to pack >>>>>>> more pods from the same job to the same host, which can create troubles >>>>>>> as >>>>>>> they are more likely to ask for max memory at the same time >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks for sharing this interesting proposal. >>>>>>>> >>>>>>>> My initial understanding is that Kubernetes will use the Executor >>>>>>>> Memory Request (H + G) for scheduling decisions, which allows for >>>>>>>> better resource packing. I have a few questions regarding the >>>>>>>> shared portion S: >>>>>>>> >>>>>>>> 1. How is the risk of host-level OOM mitigated when the total >>>>>>>> potential usage sum of H+G+S across all pods on a node exceeds its >>>>>>>> allocatable capacity? Does the proposal implicitly rely on the >>>>>>>> cluster >>>>>>>> operator to manually ensure an unrequested memory buffer exists on >>>>>>>> the node >>>>>>>> to serve as the shared pool? >>>>>>>> 2. Have you considered scheduling optimizations to ensure a >>>>>>>> strategic mix of executors with large S and small S values on a >>>>>>>> single node? I am wondering if this would reduce the probability of >>>>>>>> concurrent bursting and host-level OOM. >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I think I'm still missing something in the big picture: >>>>>>>>> >>>>>>>>> - Is the memory overhead off-heap? The formular indicates a >>>>>>>>> fixed heap size, and memory overhead can't be dynamic if it's >>>>>>>>> on-heap. >>>>>>>>> - Do Spark applications have static profiles? When we submit >>>>>>>>> stages, the cluster is already allocated, how can we change >>>>>>>>> anything? >>>>>>>>> - How do we assign the shared memory overhead? Fairly among >>>>>>>>> all applications on the same physical node? >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> we didn't separate the design into another doc since the main >>>>>>>>>> idea is relatively simple... >>>>>>>>>> >>>>>>>>>> for request/limit calculation, I described it in Q4 of the SPIP >>>>>>>>>> doc >>>>>>>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0 >>>>>>>>>> >>>>>>>>>> it is calculated based on per profile (you can say it is based on >>>>>>>>>> per stage), when the cluster manager compose the pod spec, it >>>>>>>>>> calculates >>>>>>>>>> the new memory overhead based on what user asks for in that resource >>>>>>>>>> profile >>>>>>>>>> >>>>>>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Do we have a design sketch? How to determine the memory request >>>>>>>>>>> and limit? Is it per stage or per executor? >>>>>>>>>>> >>>>>>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> yeah, the implementation is basically relying on the >>>>>>>>>>>> request/limit concept in K8S, ... >>>>>>>>>>>> >>>>>>>>>>>> but if there is any other cluster manager coming in future, as >>>>>>>>>>>> long as it has a similar concept , it can leverage this easily as >>>>>>>>>>>> the main >>>>>>>>>>>> logic is implemented in ResourceProfile >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> This feature is only available on k8s because it allows >>>>>>>>>>>>> containers to have dynamic resources? >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Folks, >>>>>>>>>>>>>> >>>>>>>>>>>>>> We are proposing a burst-aware memoryOverhead allocation >>>>>>>>>>>>>> algorithm for Spark@K8S to improve memory utilization of >>>>>>>>>>>>>> spark clusters. >>>>>>>>>>>>>> Please see more details in SPIP doc >>>>>>>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>. >>>>>>>>>>>>>> Feedbacks and discussions are welcomed. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks Chao for being shepard of this feature. >>>>>>>>>>>>>> Also want to thank the authors of the original paper >>>>>>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from >>>>>>>>>>>>>> ByteDance, specifically Rui([email protected]) and Yixin( >>>>>>>>>>>>>> [email protected]). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thank you. >>>>>>>>>>>>>> Yao Wang >>>>>>>>>>>>>> >>>>>>>>>>>>>
