> One of the issues with stack growth checks is the increased code size, 
which leads to higher L1i cache pressure.
 
Go generally isn't designed for computation-heavy workloads (e.g. matrix 
multiplication), and the compiler backend prioritizes compilation speed 
over the absolute performance of the generated code. In my experience, most 
Go server applications are front-end bound and front-end stalls tend to be 
a major bottleneck.
That's why we should care about L1i performance.
On Tuesday, 4 November 2025 at 12:12:47 UTC+3 Arseny Samoylov wrote:

>
> > Remember it is no longer the number of conditionals or reads but
> > rather the number of L1 cache misses and writes that cause
> > multi-core cache coherency stalls that dominate performance
> > typically today.
>
> You are right. One of the issues with stack growth checks is the increased 
> code size, which leads to higher L1i cache pressure. 
> Each check takes roughly 10 instructions: load the end address of the 
> stack, compute the remaining space, branch if insufficient, spill registers 
> (since in Go's ABI all registers are caller-saved, so it can be up to 16 x 
> 2 push/pops for spill/fill), call runtime.morestack, fill registers, and 
> retry again. 
> If a medium-sized functions consists of about 100-200 instructions, so 
> this presents roughly 10%-5% code size overhead.
>
> > Even more importantly, signal handling is very slow, and often "late".
>
> Absolutely - I agree. The idea here is to reserve a large, lazily 
> allocated stack (e.g. 8 MB) so that we almost never hit the limit.
> The page-fault-based reallocation would only serve as a safety mechanism - 
> ensuring that, if a goroutine stack ever does reaches it's limit, it can 
> either be reallocated safely or cause a controlled panic.
> On Tuesday, 4 November 2025 at 02:03:51 UTC+3 Jason E. Aten wrote:
>
>> Hi Arseny,
>>
>> Remember it is no longer the number of conditionals or reads but
>> rather the number of L1 cache misses and writes that cause
>> multi-core cache coherency stalls that dominate performance
>> typically today.
>>
>> Even more importantly, signal handling is very slow, and often "late".
>>
>> You would have to 
>> 1) trap the page fault and context switch to the kernel, 
>> 2) context switch back to the signal handler,
>> 3) context switch back to the kernel to allocate a page or change its 
>> protections, 
>> and then 
>> 4) context switch back to the original faulting code. 
>>
>> You are looking at least 3 but usually 4 context switches; this will be 
>> much, much slower than using the Go allocator.
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/golang-nuts/27b84849-e863-4d5c-97e7-1da3a04cc87fn%40googlegroups.com.

Reply via email to