Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Nan Zhu Tue, 09 Dec 2025 09:36:07 -0800

Thanks for the good questions

@Wenchen Fan <[email protected]> , please check the inlined answers


> Is the memory overhead off-heap? The formular indicates a fixed heap
size, and memory overhead can't be dynamic if it's on-heap.

yes, memory overhead is off-heap, the on-heap part in the formula is only
used to calculate how much memoryOverhead will be taken as guaranteed

i.e. if we set bursty factor as 1.2, user asked for 10G on-heap and 4G
memoryOverhead, then the pod memory request and limit will be

R = 10 + (4 - min((10 + 4) * (1.2 - 1), 4)) = 11.2 G

L = 10 + 4 =  14G

the only changed part is memoryOverhead and on-heap is always fixed

> Do Spark applications have static profiles? When we submit stages, the
cluster is already allocated, how can we change anything?

yeah, the resource profiles should be defined before the application started

but with this approach we can use less resources to accommodate more jobs,
e.g. if we have a single node cluster with 100G RAM space, we have two pods
requesting 40G + 10G (on-heap + memoryOverhead) and we set bursty factor to
1.2,

without the mechanism proposed in this SPIP, we can at most host 2 pods
with this machine,

When applying the burst-aware memory allocation, we only need 40 + 10 -
min((40 + 10) * 0.2, 10) = 40G to host each pod, i.e. we have 20G free
memory space left in the machine which can be used to host more pods.
Meanwhile, we still set pod limit to 50G, so the original pod can still use
50G at max where we assume that they rarely ask for max amount of memory at
the same time

> How do we assign the shared memory overhead? Fairly among all
applications on the same physical node?

basically the spark scheduler asks for resources from K8S scheduler by
sending the adjusted request/limit value of each pod and K8S scheduler
schedule the pod to the host which have at least X memory space (X is the
request value)



On Mon, Dec 8, 2025 at 11:49 PM Wenchen Fan <[email protected]> wrote:

> I think I'm still missing something in the big picture:
>
>    - Is the memory overhead off-heap? The formular indicates a fixed heap
>    size, and memory overhead can't be dynamic if it's on-heap.
>    - Do Spark applications have static profiles? When we submit stages,
>    the cluster is already allocated, how can we change anything?
>    - How do we assign the shared memory overhead? Fairly among all
>    applications on the same physical node?
>
>
> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <[email protected]> wrote:
>
>> we didn't separate the design into another doc since the main idea is
>> relatively simple...
>>
>> for request/limit calculation, I described it in Q4 of the SPIP doc
>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0
>>
>> it is calculated based on per profile (you can say it is based on per
>> stage), when the cluster manager compose the pod spec, it calculates the
>> new memory overhead based on what user asks for in that resource profile
>>
>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <[email protected]> wrote:
>>
>>> Do we have a design sketch? How to determine the memory request and
>>> limit? Is it per stage or per executor?
>>>
>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <[email protected]> wrote:
>>>
>>>> yeah, the implementation is basically relying on the request/limit
>>>> concept in K8S, ...
>>>>
>>>> but if there is any other cluster manager coming in future,  as long as
>>>> it has a similar concept , it can leverage this easily as the main logic is
>>>> implemented in ResourceProfile
>>>>
>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <[email protected]> wrote:
>>>>
>>>>> This feature is only available on k8s because it allows containers to
>>>>> have dynamic resources?
>>>>>
>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <[email protected]> wrote:
>>>>>
>>>>>> Hi Folks,
>>>>>>
>>>>>> We are proposing a burst-aware memoryOverhead allocation algorithm
>>>>>> for Spark@K8S to improve memory utilization of spark clusters.
>>>>>> Please see more details in SPIP doc
>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>.
>>>>>> Feedbacks and discussions are welcomed.
>>>>>>
>>>>>> Thanks Chao for being shepard of this feature.
>>>>>> Also want to thank the authors of the original paper
>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from ByteDance,
>>>>>> specifically Rui([email protected]) and Yixin(
>>>>>> [email protected]).
>>>>>>
>>>>>> Thank you.
>>>>>> Yao Wang
>>>>>>
>>>>>

Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Reply via email to