[go-nuts] Asking for guidance: estimate and optimize synchronization cost and memory access latency(NUMA) for multicore(256 logical cores) ARM server

Qingwei Li Mon, 27 Oct 2025 01:20:00 -0700

Hello, everyone. I'm new to NUMA and synchronization primitives in ARM, 
like DMB ST/SY/... instructions but I want to do some performance analysis 
and do some runtime optimizations or compiler transformation pass for these 
scenarios. Although I know these names, I don't know insights of these 
concepts and can't infer the research steps to find the targets to be 
optimized.


I have proposed 2 ideas based on my naive understanding of NUMA, 
synchronization primitives and Go runtime system, especially memory 
allocation.

*idea1. allocate objects in the corresponding mcache(or local cache) where 
they are frequently accessed.*

The motivating example is as follows:

```
func capture_by_ref() {  
    // Capture by reference  
    x := 0  
    go func() {
        for range N {
            x++
        }  
    }()  
    _ = x  
}  
```

The pseudo-IR/SSA can be:

```
capture_by_ref:
    px = newobject(int)
    closure = &funcval{thunk, px}
    newproc(closure) // just pseudo-code, may not be correct
```

Here `x` is allocated in the corresponding mcache(local cache) of "main" 
goroutine, but frequently accessed in the corresponding mcache(local cache) 
of `go func` goroutine. This may cause access large latency in NUMA. *I'm 
not sure. Maybe cache can eliminate the latency but the first access from 
remote machine can still cause large latency, about which I have no 
insights and just guess.*

If the assumption of access latency is correct or verified, the proposed 
solution is as follows:
1. find objects in this pattern, i.e. allocated in one goroutine but passed 
to another goroutine and accessed much more frequently than original 
goroutine through static analysis(like escape analysis) or dynamic 
analysis(like thread sanitizer with runtime replaced by access counter. I 
tried this in C++ tsan, but not in Go. Meanwhile, I'm not familiar with Go 
tsan)
2. transform code to the following pattern

```
capture_by_ref
    px = newobject_remote(int, nextprocid()) // nextprocid will assist the 
allocator to allocate in corresponding mcache
    closure = &funcval{thunk, px}
    newproc(closure) // newproc will create new goroutine binded to the 
mcache and P
```

*idea2: this is based on issue 63640. *runtime: mallocgc does not seem to 
need to call publicationBarrier when allocating noscan objects · Issue 
#63640 · golang/go <https://github.com/golang/go/issues/63640>.

Is there any simple testcase that can illustrate the optimization effect of 
DMB ST removal. Are there some other variables like GC to make the 
optimization effect more convincing.

(Sorry for the 2nd idea. The expression is unclear now. So I here asking 
for guidance.)

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/golang-nuts/62a55625-7eeb-4f28-b142-f978f8bb58dfn%40googlegroups.com.

[go-nuts] Asking for guidance: estimate and optimize synchronization cost and memory access latency(NUMA) for multicore(256 logical cores) ARM server

Reply via email to