> One of the issues with stack growth checks is the increased code size, which leads to higher L1i cache pressure. Go generally isn't designed for computation-heavy workloads (e.g. matrix multiplication), and the compiler backend prioritizes compilation speed over the absolute performance of the generated code. In my experience, most Go server applications are front-end bound and front-end stalls tend to be a major bottleneck. That's why we should care about L1i performance. On Tuesday, 4 November 2025 at 12:12:47 UTC+3 Arseny Samoylov wrote:
> > > Remember it is no longer the number of conditionals or reads but > > rather the number of L1 cache misses and writes that cause > > multi-core cache coherency stalls that dominate performance > > typically today. > > You are right. One of the issues with stack growth checks is the increased > code size, which leads to higher L1i cache pressure. > Each check takes roughly 10 instructions: load the end address of the > stack, compute the remaining space, branch if insufficient, spill registers > (since in Go's ABI all registers are caller-saved, so it can be up to 16 x > 2 push/pops for spill/fill), call runtime.morestack, fill registers, and > retry again. > If a medium-sized functions consists of about 100-200 instructions, so > this presents roughly 10%-5% code size overhead. > > > Even more importantly, signal handling is very slow, and often "late". > > Absolutely - I agree. The idea here is to reserve a large, lazily > allocated stack (e.g. 8 MB) so that we almost never hit the limit. > The page-fault-based reallocation would only serve as a safety mechanism - > ensuring that, if a goroutine stack ever does reaches it's limit, it can > either be reallocated safely or cause a controlled panic. > On Tuesday, 4 November 2025 at 02:03:51 UTC+3 Jason E. Aten wrote: > >> Hi Arseny, >> >> Remember it is no longer the number of conditionals or reads but >> rather the number of L1 cache misses and writes that cause >> multi-core cache coherency stalls that dominate performance >> typically today. >> >> Even more importantly, signal handling is very slow, and often "late". >> >> You would have to >> 1) trap the page fault and context switch to the kernel, >> 2) context switch back to the signal handler, >> 3) context switch back to the kernel to allocate a page or change its >> protections, >> and then >> 4) context switch back to the original faulting code. >> >> You are looking at least 3 but usually 4 context switches; this will be >> much, much slower than using the Go allocator. >> >> -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/golang-nuts/27b84849-e863-4d5c-97e7-1da3a04cc87fn%40googlegroups.com.
