+1 On Wed, Dec 17, 2025, 2:48 AM Wenchen Fan <[email protected]> wrote:
> +1 > > On Wed, Dec 17, 2025 at 6:41 AM karuppayya <[email protected]> > wrote: > >> +1 from me. >> I think it's well-scoped and takes advantage of Kubernetes' features >> exactly for what they are designed for(as per my understanding). >> >> On Tue, Dec 16, 2025 at 8:17 AM Chao Sun <[email protected]> wrote: >> >>> Thanks Yao and Nan for the proposal, and thanks everyone for the >>> detailed and thoughtful discussion. >>> >>> Overall, this looks like a valuable addition for organizations running >>> Spark on Kubernetes, especially given how bursty memoryOverhead usage >>> tends to be in practice. I appreciate that the change is relatively small >>> in scope and fully opt-in, which helps keep the risk low. >>> >>> From my perspective, the questions raised on the thread and in the SPIP >>> have been addressed. If others feel the same, do we have consensus to move >>> forward with a vote? cc Wenchen, Qieqiang, and Karuppayya. >>> >>> Best, >>> Chao >>> >>> On Thu, Dec 11, 2025 at 11:32 PM Nan Zhu <[email protected]> wrote: >>> >>>> this is a good question >>>> >>>> > a stage is bursty and consumes the shared portion and fails to >>>> release it for subsequent stages >>>> >>>> in the scenario you described, since the memory-leaking stage and the >>>> subsequence ones are from the same job , the pod will likely be killed by >>>> cgroup oomkiller >>>> >>>> taking the following as the example >>>> >>>> the usage pattern is G = 5GB S = 2GB, it uses G + S at max and in >>>> theory, it should release all 7G and then claim 7G again in some later >>>> stages, however, due to the memory peak, it holds 2G forever and ask for >>>> another 7G, as a result, it hits the pod memory limit and cgroup >>>> oomkiller will take action to terminate the pod >>>> >>>> so this should be safe to the system >>>> >>>> >>>> >>>> however, we should be careful about the memory peak for sure, because >>>> it essentially breaks the assumption that the usage of memoryOverhead is >>>> bursty (memory peak ~= use memory forever)... unfortunately, >>>> shared/guaranteed memory is managed by user applications instead of on >>>> cluster level , they, especially S, are just logical concepts instead of a >>>> physical memory pool which pods can explicitly claim memory from... >>>> >>>> >>>> On Thu, Dec 11, 2025 at 10:17 PM karuppayya <[email protected]> >>>> wrote: >>>> >>>>> Thanks for the interesting proposal. >>>>> The design seems to rely on memoryOverhead being transient. >>>>> What happens when a stage is bursty and consumes the shared portion >>>>> and fails to release it for subsequent stages (e.g., off-heap buffers and >>>>> its not garbage collected since its off-heap)? Would this trigger the >>>>> host-level OOM like described in Q6? or are there strategies to release >>>>> the >>>>> shared portion? >>>>> >>>>> >>>>> On Thu, Dec 11, 2025 at 6:24 PM Nan Zhu <[email protected]> >>>>> wrote: >>>>> >>>>>> yes, that's the worst case in the scenario, please check my earlier >>>>>> response to Qiegang's question, we have a set of strategies adopted in >>>>>> prod >>>>>> to mitigate the issue >>>>>> >>>>>> On Thu, Dec 11, 2025 at 6:21 PM Wenchen Fan <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Thanks for the explanation! So the executor is not guaranteed to get >>>>>>> 50 GB physical memory, right? All pods on the same host may reach peak >>>>>>> memory usage at the same time and cause paging/swapping which hurts >>>>>>> performance? >>>>>>> >>>>>>> On Fri, Dec 12, 2025 at 10:12 AM Nan Zhu <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> np, let me try to explain >>>>>>>> >>>>>>>> 1. Each executor container will be run in a pod together with some >>>>>>>> other sidecar containers taking care of tasks like authentication, >>>>>>>> etc. , >>>>>>>> for simplicity, we assume each pod has only one container which is the >>>>>>>> executor container >>>>>>>> >>>>>>>> 2. Each container is assigned with two values, r*equest&limit** (limit >>>>>>>> >= request),* for both of CPU/memory resources (we only discuss >>>>>>>> memory here). Each pod will have request/limit values as the sum of all >>>>>>>> containers belonging to this pod >>>>>>>> >>>>>>>> 3. K8S Scheduler chooses a machine to host a pod based on *request* >>>>>>>> value, and cap the resource usage of each container based on their >>>>>>>> *limit* value, e.g. if I have a pod with a single container in it >>>>>>>> , and it has 1G/2G as request and limit value respectively, any machine >>>>>>>> with 1G free RAM space will be a candidate to host this pod, and when >>>>>>>> the >>>>>>>> container use more than 2G memory, it will be killed by cgroup >>>>>>>> oomkiller. Once a pod is scheduled to a host, the memory space sized at >>>>>>>> "sum of all its containers' request values" will be booked exclusively >>>>>>>> for >>>>>>>> this pod. >>>>>>>> >>>>>>>> 4. By default, Spark *sets request/limit as the same value for >>>>>>>> executors in k8s*, and this value is basically >>>>>>>> spark.executor.memory + spark.executor.memoryOverhead in most cases . >>>>>>>> However, spark.executor.memoryOverhead usage is very bursty, the user >>>>>>>> setting spark.executor.memoryOverhead as 10G usually means each >>>>>>>> executor >>>>>>>> only needs 10G in a very small portion of the executor's whole >>>>>>>> lifecycle >>>>>>>> >>>>>>>> 5. The proposed SPIP is essentially to decouple request/limit value >>>>>>>> in spark@k8s for executors in a safe way (this idea is from the >>>>>>>> bytedance paper we refer to in SPIP paper). >>>>>>>> >>>>>>>> Using the aforementioned example , >>>>>>>> >>>>>>>> if we have a single node cluster with 100G RAM space, we have two >>>>>>>> pods requesting 40G + 10G (on-heap + memoryOverhead) and we set bursty >>>>>>>> factor to 1.2, without the mechanism proposed in this SPIP, we can at >>>>>>>> most >>>>>>>> host 2 pods with this machine, and because of the bursty usage of that >>>>>>>> 10G >>>>>>>> space, the memory utilization would be compromised. >>>>>>>> >>>>>>>> When applying the burst-aware memory allocation, we only need 40 + >>>>>>>> 10 - min((40 + 10) * 0.2, 10) = 40G to host each pod, i.e. we have 20G >>>>>>>> free >>>>>>>> memory space left in the machine which can be used to host some smaller >>>>>>>> pods. At the same time, as we didn't change the limit value of the >>>>>>>> executor >>>>>>>> pods, these executors can still use 50G at max. >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Dec 11, 2025 at 5:42 PM Wenchen Fan <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Sorry I'm not very familiar with the k8s infra, how does it work >>>>>>>>> under the hood? The container will adjust its system memory size >>>>>>>>> depending on the actual memory usage of the processes in this >>>>>>>>> container? >>>>>>>>> >>>>>>>>> On Fri, Dec 12, 2025 at 2:49 AM Nan Zhu <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> yeah, we have a few cases that we have significantly larger O >>>>>>>>>> than H, the proposed algorithm is actually a great fit >>>>>>>>>> >>>>>>>>>> as I explained in SPIP doc Appendix C, the proposed algorithm >>>>>>>>>> will allocate a non-trivial G to ensure the safety of running but >>>>>>>>>> still cut >>>>>>>>>> a big chunk of memory (10s of GBs) and treat them as S , saving tons >>>>>>>>>> of >>>>>>>>>> money burnt by them >>>>>>>>>> >>>>>>>>>> but regarding native accelerators, some native acceleration >>>>>>>>>> engines do not use memoryOverhead but use off-heap >>>>>>>>>> (spark.memory.offHeap.size) explicitly (e.g. Gluten). The current >>>>>>>>>> implementation does not cover this part , while that will be an easy >>>>>>>>>> extension >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks for the reply. >>>>>>>>>>> >>>>>>>>>>> Have you tested in environments where O is bigger than H? >>>>>>>>>>> Wondering if the proposed algorithm would help more in those >>>>>>>>>>> environments >>>>>>>>>>> (eg. with >>>>>>>>>>> native accelerators)? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, Qiegang, thanks for the good questions as well >>>>>>>>>>>> >>>>>>>>>>>> please check the following answer >>>>>>>>>>>> >>>>>>>>>>>> > My initial understanding is that Kubernetes will use the Executor >>>>>>>>>>>> Memory Request (H + G) for scheduling decisions, which allows >>>>>>>>>>>> for better resource packing. >>>>>>>>>>>> >>>>>>>>>>>> yes, your understanding is correct >>>>>>>>>>>> >>>>>>>>>>>> > How is the risk of host-level OOM mitigated when the total >>>>>>>>>>>> potential usage sum of H+G+S across all pods on a node exceeds its >>>>>>>>>>>> allocatable capacity? Does the proposal implicitly rely on the >>>>>>>>>>>> cluster >>>>>>>>>>>> operator to manually ensure an unrequested memory buffer exists on >>>>>>>>>>>> the node >>>>>>>>>>>> to serve as the shared pool? >>>>>>>>>>>> >>>>>>>>>>>> in PINS, we basically apply a set of strategies, setting >>>>>>>>>>>> conservative bursty factor, progressive rollout, monitor the >>>>>>>>>>>> cluster >>>>>>>>>>>> metrics like Linux Kernel OOMKiller occurrence to guide us to the >>>>>>>>>>>> optimal >>>>>>>>>>>> setup of bursty factor... in usual, K8S operators will set a >>>>>>>>>>>> reserved space >>>>>>>>>>>> for daemon processes on each host, we found it is sufficient to in >>>>>>>>>>>> our case >>>>>>>>>>>> and our major tuning focuses on bursty factor value >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> > Have you considered scheduling optimizations to ensure a >>>>>>>>>>>> strategic mix of executors with large S and small S values on a >>>>>>>>>>>> single >>>>>>>>>>>> node? I am wondering if this would reduce the probability of >>>>>>>>>>>> concurrent >>>>>>>>>>>> bursting and host-level OOM. >>>>>>>>>>>> >>>>>>>>>>>> Yes, when we work on this project, we put some attention on the >>>>>>>>>>>> cluster scheduling policy/behavior... two things we mostly care >>>>>>>>>>>> about >>>>>>>>>>>> >>>>>>>>>>>> 1. as stated in the SPIP doc, the cluster should have certain >>>>>>>>>>>> level of diversity of workloads so that we have enough candidates >>>>>>>>>>>> to form a >>>>>>>>>>>> mixed set of executors with large S and small S values >>>>>>>>>>>> >>>>>>>>>>>> 2. we avoid using binpack scheduling algorithm which tends to >>>>>>>>>>>> pack more pods from the same job to the same host, which can create >>>>>>>>>>>> troubles as they are more likely to ask for max memory at the same >>>>>>>>>>>> time >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Thanks for sharing this interesting proposal. >>>>>>>>>>>>> >>>>>>>>>>>>> My initial understanding is that Kubernetes will use the Executor >>>>>>>>>>>>> Memory Request (H + G) for scheduling decisions, which allows >>>>>>>>>>>>> for better resource packing. I have a few questions >>>>>>>>>>>>> regarding the shared portion S: >>>>>>>>>>>>> >>>>>>>>>>>>> 1. How is the risk of host-level OOM mitigated when the >>>>>>>>>>>>> total potential usage sum of H+G+S across all pods on a node >>>>>>>>>>>>> exceeds its >>>>>>>>>>>>> allocatable capacity? Does the proposal implicitly rely on the >>>>>>>>>>>>> cluster >>>>>>>>>>>>> operator to manually ensure an unrequested memory buffer >>>>>>>>>>>>> exists on the node >>>>>>>>>>>>> to serve as the shared pool? >>>>>>>>>>>>> 2. Have you considered scheduling optimizations to ensure >>>>>>>>>>>>> a strategic mix of executors with large S and small S values >>>>>>>>>>>>> on a single node? I am wondering if this would reduce the >>>>>>>>>>>>> probability of >>>>>>>>>>>>> concurrent bursting and host-level OOM. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I think I'm still missing something in the big picture: >>>>>>>>>>>>>> >>>>>>>>>>>>>> - Is the memory overhead off-heap? The formular indicates >>>>>>>>>>>>>> a fixed heap size, and memory overhead can't be dynamic if >>>>>>>>>>>>>> it's on-heap. >>>>>>>>>>>>>> - Do Spark applications have static profiles? When we >>>>>>>>>>>>>> submit stages, the cluster is already allocated, how can we >>>>>>>>>>>>>> change anything? >>>>>>>>>>>>>> - How do we assign the shared memory overhead? Fairly >>>>>>>>>>>>>> among all applications on the same physical node? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> we didn't separate the design into another doc since the >>>>>>>>>>>>>>> main idea is relatively simple... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> for request/limit calculation, I described it in Q4 of the >>>>>>>>>>>>>>> SPIP doc >>>>>>>>>>>>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> it is calculated based on per profile (you can say it is >>>>>>>>>>>>>>> based on per stage), when the cluster manager compose the pod >>>>>>>>>>>>>>> spec, it >>>>>>>>>>>>>>> calculates the new memory overhead based on what user asks for >>>>>>>>>>>>>>> in that >>>>>>>>>>>>>>> resource profile >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Do we have a design sketch? How to determine the memory >>>>>>>>>>>>>>>> request and limit? Is it per stage or per executor? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> yeah, the implementation is basically relying on the >>>>>>>>>>>>>>>>> request/limit concept in K8S, ... >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> but if there is any other cluster manager coming in >>>>>>>>>>>>>>>>> future, as long as it has a similar concept , it can >>>>>>>>>>>>>>>>> leverage this easily >>>>>>>>>>>>>>>>> as the main logic is implemented in ResourceProfile >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> This feature is only available on k8s because it allows >>>>>>>>>>>>>>>>>> containers to have dynamic resources? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Folks, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> We are proposing a burst-aware memoryOverhead allocation >>>>>>>>>>>>>>>>>>> algorithm for Spark@K8S to improve memory utilization >>>>>>>>>>>>>>>>>>> of spark clusters. >>>>>>>>>>>>>>>>>>> Please see more details in SPIP doc >>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>. >>>>>>>>>>>>>>>>>>> Feedbacks and discussions are welcomed. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks Chao for being shepard of this feature. >>>>>>>>>>>>>>>>>>> Also want to thank the authors of the original paper >>>>>>>>>>>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from >>>>>>>>>>>>>>>>>>> ByteDance, specifically Rui([email protected]) >>>>>>>>>>>>>>>>>>> and Yixin([email protected]). >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thank you. >>>>>>>>>>>>>>>>>>> Yao Wang >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>
