Thanks for the reply. Have you tested in environments where O is bigger than H? Wondering if the proposed algorithm would help more in those environments (eg. with native accelerators)?
On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <[email protected]> wrote: > Hi, Qiegang, thanks for the good questions as well > > please check the following answer > > > My initial understanding is that Kubernetes will use the Executor > Memory Request (H + G) for scheduling decisions, which allows for better > resource packing. > > yes, your understanding is correct > > > How is the risk of host-level OOM mitigated when the total potential > usage sum of H+G+S across all pods on a node exceeds its allocatable > capacity? Does the proposal implicitly rely on the cluster operator to > manually ensure an unrequested memory buffer exists on the node to serve as > the shared pool? > > in PINS, we basically apply a set of strategies, setting conservative > bursty factor, progressive rollout, monitor the cluster metrics like Linux > Kernel OOMKiller occurrence to guide us to the optimal setup of bursty > factor... in usual, K8S operators will set a reserved space for daemon > processes on each host, we found it is sufficient to in our case and our > major tuning focuses on bursty factor value > > > > Have you considered scheduling optimizations to ensure a strategic mix > of executors with large S and small S values on a single node? I am > wondering if this would reduce the probability of concurrent bursting and > host-level OOM. > > Yes, when we work on this project, we put some attention on the cluster > scheduling policy/behavior... two things we mostly care about > > 1. as stated in the SPIP doc, the cluster should have certain level of > diversity of workloads so that we have enough candidates to form a mixed > set of executors with large S and small S values > > 2. we avoid using binpack scheduling algorithm which tends to pack more > pods from the same job to the same host, which can create troubles as they > are more likely to ask for max memory at the same time > > > > On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <[email protected]> wrote: > >> Thanks for sharing this interesting proposal. >> >> My initial understanding is that Kubernetes will use the Executor Memory >> Request (H + G) for scheduling decisions, which allows for better >> resource packing. I have a few questions regarding the shared portion S: >> >> 1. How is the risk of host-level OOM mitigated when the total >> potential usage sum of H+G+S across all pods on a node exceeds its >> allocatable capacity? Does the proposal implicitly rely on the cluster >> operator to manually ensure an unrequested memory buffer exists on the >> node >> to serve as the shared pool? >> 2. Have you considered scheduling optimizations to ensure a strategic >> mix of executors with large S and small S values on a single node? I >> am wondering if this would reduce the probability of concurrent bursting >> and host-level OOM. >> >> >> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <[email protected]> wrote: >> >>> I think I'm still missing something in the big picture: >>> >>> - Is the memory overhead off-heap? The formular indicates a fixed >>> heap size, and memory overhead can't be dynamic if it's on-heap. >>> - Do Spark applications have static profiles? When we submit stages, >>> the cluster is already allocated, how can we change anything? >>> - How do we assign the shared memory overhead? Fairly among all >>> applications on the same physical node? >>> >>> >>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <[email protected]> wrote: >>> >>>> we didn't separate the design into another doc since the main idea is >>>> relatively simple... >>>> >>>> for request/limit calculation, I described it in Q4 of the SPIP doc >>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0 >>>> >>>> it is calculated based on per profile (you can say it is based on per >>>> stage), when the cluster manager compose the pod spec, it calculates the >>>> new memory overhead based on what user asks for in that resource profile >>>> >>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <[email protected]> wrote: >>>> >>>>> Do we have a design sketch? How to determine the memory request and >>>>> limit? Is it per stage or per executor? >>>>> >>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <[email protected]> wrote: >>>>> >>>>>> yeah, the implementation is basically relying on the request/limit >>>>>> concept in K8S, ... >>>>>> >>>>>> but if there is any other cluster manager coming in future, as long >>>>>> as it has a similar concept , it can leverage this easily as the main >>>>>> logic >>>>>> is implemented in ResourceProfile >>>>>> >>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> This feature is only available on k8s because it allows containers >>>>>>> to have dynamic resources? >>>>>>> >>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <[email protected]> wrote: >>>>>>> >>>>>>>> Hi Folks, >>>>>>>> >>>>>>>> We are proposing a burst-aware memoryOverhead allocation algorithm >>>>>>>> for Spark@K8S to improve memory utilization of spark clusters. >>>>>>>> Please see more details in SPIP doc >>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>. >>>>>>>> Feedbacks and discussions are welcomed. >>>>>>>> >>>>>>>> Thanks Chao for being shepard of this feature. >>>>>>>> Also want to thank the authors of the original paper >>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from ByteDance, >>>>>>>> specifically Rui([email protected]) and Yixin( >>>>>>>> [email protected]). >>>>>>>> >>>>>>>> Thank you. >>>>>>>> Yao Wang >>>>>>>> >>>>>>>
