I have a server which is doing log processing and I'm trying to
improve throughput. The bottleneck is access to a shared resource
protected by a sync.Mutex. The issue is that even though there are
nearly always worker goroutines blocked on Lock, only about 10% of the
time (in aggregate) is spent running the code inside the critical
section.

To be concrete, I have 300 worker goroutines, and they collectively
spend about 280 seconds / second blocked on the mutex, and about 100
ms / second inside (as measured from Lock returning to Unlock
returning).

To investigate this, I added some instrumentation around sync.Mutex. I
log events when a Lock is called, when Lock returns (acquire lock),
and when Unlock returns. This shows me that, typically, my code takes
around 5-15 μs between acquire and unlock, but it takes tens or
hundreds of μs between unlock and some other goroutine acquiring the
lock.

Here's a short section of logged events. The number in square brackets
is the goroutine ID; the timestamp is μs since the trace started.

[209]    73018.628   acquire lock (+38.798)
[209]    73028.707   unlock
[298]    73049.222   call lock
[199]    73068.748   call lock
[214]    73153.182   call lock
[355]    73259.473   call lock
[157]    73308.336   call lock
[256]    73339.965   call lock
[389]    73517.476   call lock
[325]    73541.077   acquire lock (+512.370)
[325]    73545.650   unlock
[396]    73581.819   call lock
[174]    73645.743   call lock
[224]    73708.319   call lock
[292]    73801.945   call lock
[285]    73855.717   call lock
[357]    73915.478   call lock
[276]    74013.148   acquire lock (+467.498)

Here you can see that goroutine 209 held the lock for only about 10 μs
before returning from mu.Unlock at t=73028.707 but the next goroutine
(325) didn't return from mu.Lock until t=73541.077, more than 500 μs
after that. Goroutine 325 returned from mu.Unlock after about 5 μs,
but then it took 467 μs for another goroutine (276) to return from
mu.Lock.
What is the best way to debug (and, hopefully, fix) this issue? I'm
not really sure how to diagnose scheduling problems. I looked at
GODEBUG=schedtrace=1000 output but I'm not sure how to interpret it.
Here's a few seconds of that output for this server:

2017/10/16 10:38:00 SCHED 253295ms: gomaxprocs=36 idleprocs=0
threads=56 spinningthreads=18 idlethreads=15 runqueue=1 [0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0]
2017/10/16 10:38:01 SCHED 254305ms: gomaxprocs=36 idleprocs=1
threads=56 spinningthreads=18 idlethreads=14 runqueue=0 [0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
2017/10/16 10:38:02 SCHED 255307ms: gomaxprocs=36 idleprocs=2
threads=56 spinningthreads=18 idlethreads=15 runqueue=0 [0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
2017/10/16 10:38:03 SCHED 256311ms: gomaxprocs=36 idleprocs=1
threads=56 spinningthreads=18 idlethreads=13 runqueue=0 [0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
2017/10/16 10:38:04 SCHED 257313ms: gomaxprocs=36 idleprocs=1
threads=56 spinningthreads=17 idlethreads=13 runqueue=0 [0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
2017/10/16 10:38:05 SCHED 258318ms: gomaxprocs=36 idleprocs=1
threads=56 spinningthreads=18 idlethreads=15 runqueue=0 [0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
2017/10/16 10:38:06 SCHED 259318ms: gomaxprocs=36 idleprocs=3
threads=56 spinningthreads=17 idlethreads=17 runqueue=0 [0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
2017/10/16 10:38:07 SCHED 260323ms: gomaxprocs=36 idleprocs=1
threads=56 spinningthreads=18 idlethreads=14 runqueue=0 [0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
2017/10/16 10:38:08 SCHED 261324ms: gomaxprocs=36 idleprocs=2
threads=56 spinningthreads=18 idlethreads=15 runqueue=0 [0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
2017/10/16 10:38:09 SCHED 262327ms: gomaxprocs=36 idleprocs=0
threads=56 spinningthreads=18 idlethreads=15 runqueue=1 [0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
2017/10/16 10:38:10 SCHED 263333ms: gomaxprocs=36 idleprocs=1
threads=56 spinningthreads=18 idlethreads=14 runqueue=0 [0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
2017/10/16 10:38:11 SCHED 264333ms: gomaxprocs=36 idleprocs=4
threads=56 spinningthreads=18 idlethreads=15 runqueue=0 [0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
2017/10/16 10:38:12 SCHED 265336ms: gomaxprocs=36 idleprocs=3
threads=56 spinningthreads=19 idlethreads=14 runqueue=1 [0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
2017/10/16 10:38:13 SCHED 266339ms: gomaxprocs=36 idleprocs=0
threads=56 spinningthreads=14 idlethreads=15 runqueue=1 [0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Any ideas?

Thanks in advance!
Caleb

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to