Re: [go-nuts] Go Stack Design Proposal

Arseny Samoylov Wed, 05 Nov 2025 01:57:43 -0800

>  And now with inlining more common, functions tend to be bigger than they 
once were, so the amortized and actual cost are both reduced.


You are correct that inlining helps reduce the cost of stack-grows checks. 
However, interface method calls present a significant obstacle to inlining 
and go code frequently uses them. Also, the inlining budget without PGO is 
quite modest: one uninlined call already takes almost the entire budget (57 
out 80, see 
https://github.com/golang/go/blob/master/src/cmd/compile/internal/inline/inl.go#L53).

>  I have seen other systems trying to do this, and in most cases 
eventually abandoning it due to unforeseen complexity. Not saying it can't 
be done, but it's not easy.
Can you please provide some links? It would be very helpful! Because I 
couldn't find any case studies of GC'd language/runtime that tried to use 
page-fault-based stack growth for coroutines =(.

On Wednesday, 5 November 2025 at 08:19:53 UTC+3 Rob Pike wrote:

> I believe you are overstating the cost. It's measurable but not as severe 
> as you state. And now with inlining more common, functions tend to be 
> bigger than they once were, so the amortized and actual cost are both 
> reduced.
>
> Moreover, using traps for stack growth has been problematic in the past. I 
> have seen other systems trying to do this, and in most cases eventually 
> abandoning it due to unforeseen complexity. Not saying it can't be done, 
> but it's not easy. Plus it is hard to do portably since the details will 
> depend on the architecture and the operating system.
>
> It's not up to me, but I wouldn't do this.
>
> -rob
>
>
> On Tue, Nov 4, 2025 at 8:47 PM Arseny Samoylov <[email protected]> 
> wrote:
>
>> > One of the issues with stack growth checks is the increased code size, 
>> which leads to higher L1i cache pressure.
>>  
>> Go generally isn't designed for computation-heavy workloads (e.g. matrix 
>> multiplication), and the compiler backend prioritizes compilation speed 
>> over the absolute performance of the generated code. In my experience, most 
>> Go server applications are front-end bound and front-end stalls tend to be 
>> a major bottleneck.
>> That's why we should care about L1i performance.
>> On Tuesday, 4 November 2025 at 12:12:47 UTC+3 Arseny Samoylov wrote:
>>
>>>
>>> > Remember it is no longer the number of conditionals or reads but
>>> > rather the number of L1 cache misses and writes that cause
>>> > multi-core cache coherency stalls that dominate performance
>>> > typically today.
>>>
>>> You are right. One of the issues with stack growth checks is the 
>>> increased code size, which leads to higher L1i cache pressure. 
>>> Each check takes roughly 10 instructions: load the end address of the 
>>> stack, compute the remaining space, branch if insufficient, spill registers 
>>> (since in Go's ABI all registers are caller-saved, so it can be up to 16 x 
>>> 2 push/pops for spill/fill), call runtime.morestack, fill registers, and 
>>> retry again. 
>>> If a medium-sized functions consists of about 100-200 instructions, so 
>>> this presents roughly 10%-5% code size overhead.
>>>
>>> > Even more importantly, signal handling is very slow, and often "late".
>>>
>>> Absolutely - I agree. The idea here is to reserve a large, lazily 
>>> allocated stack (e.g. 8 MB) so that we almost never hit the limit.
>>> The page-fault-based reallocation would only serve as a safety mechanism 
>>> - ensuring that, if a goroutine stack ever does reaches it's limit, it can 
>>> either be reallocated safely or cause a controlled panic.
>>> On Tuesday, 4 November 2025 at 02:03:51 UTC+3 Jason E. Aten wrote:
>>>
>>>> Hi Arseny,
>>>>
>>>> Remember it is no longer the number of conditionals or reads but
>>>> rather the number of L1 cache misses and writes that cause
>>>> multi-core cache coherency stalls that dominate performance
>>>> typically today.
>>>>
>>>> Even more importantly, signal handling is very slow, and often "late".
>>>>
>>>> You would have to 
>>>> 1) trap the page fault and context switch to the kernel, 
>>>> 2) context switch back to the signal handler,
>>>> 3) context switch back to the kernel to allocate a page or change its 
>>>> protections, 
>>>> and then 
>>>> 4) context switch back to the original faulting code. 
>>>>
>>>> You are looking at least 3 but usually 4 context switches; this will be 
>>>> much, much slower than using the Go allocator.
>>>>
>>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "golang-nuts" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion visit 
>> https://groups.google.com/d/msgid/golang-nuts/27b84849-e863-4d5c-97e7-1da3a04cc87fn%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/golang-nuts/27b84849-e863-4d5c-97e7-1da3a04cc87fn%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/golang-nuts/67275995-4a13-4302-8133-9bfe9ce91a96n%40googlegroups.com.

Re: [go-nuts] Go Stack Design Proposal

Reply via email to