Hello, everyone. I'm new to NUMA and synchronization primitives in ARM,
like DMB ST/SY/... instructions but I want to do some performance analysis
and do some runtime optimizations or compiler transformation pass for these
scenarios. Although I know these names, I don't know insights of these
concepts and can't infer the research steps to find the targets to be
optimized.
I have proposed 2 ideas based on my naive understanding of NUMA,
synchronization primitives and Go runtime system, especially memory
allocation.
*idea1. allocate objects in the corresponding mcache(or local cache) where
they are frequently accessed.*
The motivating example is as follows:
```
func capture_by_ref() {
// Capture by reference
x := 0
go func() {
for range N {
x++
}
}()
_ = x
}
```
The pseudo-IR/SSA can be:
```
capture_by_ref:
px = newobject(int)
closure = &funcval{thunk, px}
newproc(closure) // just pseudo-code, may not be correct
```
Here `x` is allocated in the corresponding mcache(local cache) of "main"
goroutine, but frequently accessed in the corresponding mcache(local cache)
of `go func` goroutine. This may cause access large latency in NUMA. *I'm
not sure. Maybe cache can eliminate the latency but the first access from
remote machine can still cause large latency, about which I have no
insights and just guess.*
If the assumption of access latency is correct or verified, the proposed
solution is as follows:
1. find objects in this pattern, i.e. allocated in one goroutine but passed
to another goroutine and accessed much more frequently than original
goroutine through static analysis(like escape analysis) or dynamic
analysis(like thread sanitizer with runtime replaced by access counter. I
tried this in C++ tsan, but not in Go. Meanwhile, I'm not familiar with Go
tsan)
2. transform code to the following pattern
```
capture_by_ref
px = newobject_remote(int, nextprocid()) // nextprocid will assist the
allocator to allocate in corresponding mcache
closure = &funcval{thunk, px}
newproc(closure) // newproc will create new goroutine binded to the
mcache and P
```
*idea2: this is based on issue 63640. *runtime: mallocgc does not seem to
need to call publicationBarrier when allocating noscan objects · Issue
#63640 · golang/go <https://github.com/golang/go/issues/63640>.
Is there any simple testcase that can illustrate the optimization effect of
DMB ST removal. Are there some other variables like GC to make the
optimization effect more convincing.
(Sorry for the 2nd idea. The expression is unclear now. So I here asking
for guidance.)
--
You received this message because you are subscribed to the Google Groups
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/golang-nuts/62a55625-7eeb-4f28-b142-f978f8bb58dfn%40googlegroups.com.