I have a server which is doing log processing and I'm trying to improve throughput. The bottleneck is access to a shared resource protected by a sync.Mutex. The issue is that even though there are nearly always worker goroutines blocked on Lock, only about 10% of the time (in aggregate) is spent running the code inside the critical section.
To be concrete, I have 300 worker goroutines, and they collectively spend about 280 seconds / second blocked on the mutex, and about 100 ms / second inside (as measured from Lock returning to Unlock returning). To investigate this, I added some instrumentation around sync.Mutex. I log events when a Lock is called, when Lock returns (acquire lock), and when Unlock returns. This shows me that, typically, my code takes around 5-15 μs between acquire and unlock, but it takes tens or hundreds of μs between unlock and some other goroutine acquiring the lock. Here's a short section of logged events. The number in square brackets is the goroutine ID; the timestamp is μs since the trace started. [209] 73018.628 acquire lock (+38.798) [209] 73028.707 unlock [298] 73049.222 call lock [199] 73068.748 call lock [214] 73153.182 call lock [355] 73259.473 call lock [157] 73308.336 call lock [256] 73339.965 call lock [389] 73517.476 call lock [325] 73541.077 acquire lock (+512.370) [325] 73545.650 unlock [396] 73581.819 call lock [174] 73645.743 call lock [224] 73708.319 call lock [292] 73801.945 call lock [285] 73855.717 call lock [357] 73915.478 call lock [276] 74013.148 acquire lock (+467.498) Here you can see that goroutine 209 held the lock for only about 10 μs before returning from mu.Unlock at t=73028.707 but the next goroutine (325) didn't return from mu.Lock until t=73541.077, more than 500 μs after that. Goroutine 325 returned from mu.Unlock after about 5 μs, but then it took 467 μs for another goroutine (276) to return from mu.Lock. What is the best way to debug (and, hopefully, fix) this issue? I'm not really sure how to diagnose scheduling problems. I looked at GODEBUG=schedtrace=1000 output but I'm not sure how to interpret it. Here's a few seconds of that output for this server: 2017/10/16 10:38:00 SCHED 253295ms: gomaxprocs=36 idleprocs=0 threads=56 spinningthreads=18 idlethreads=15 runqueue=1 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0] 2017/10/16 10:38:01 SCHED 254305ms: gomaxprocs=36 idleprocs=1 threads=56 spinningthreads=18 idlethreads=14 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 2017/10/16 10:38:02 SCHED 255307ms: gomaxprocs=36 idleprocs=2 threads=56 spinningthreads=18 idlethreads=15 runqueue=0 [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 2017/10/16 10:38:03 SCHED 256311ms: gomaxprocs=36 idleprocs=1 threads=56 spinningthreads=18 idlethreads=13 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0] 2017/10/16 10:38:04 SCHED 257313ms: gomaxprocs=36 idleprocs=1 threads=56 spinningthreads=17 idlethreads=13 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 2017/10/16 10:38:05 SCHED 258318ms: gomaxprocs=36 idleprocs=1 threads=56 spinningthreads=18 idlethreads=15 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 2017/10/16 10:38:06 SCHED 259318ms: gomaxprocs=36 idleprocs=3 threads=56 spinningthreads=17 idlethreads=17 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 2017/10/16 10:38:07 SCHED 260323ms: gomaxprocs=36 idleprocs=1 threads=56 spinningthreads=18 idlethreads=14 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 2017/10/16 10:38:08 SCHED 261324ms: gomaxprocs=36 idleprocs=2 threads=56 spinningthreads=18 idlethreads=15 runqueue=0 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 2017/10/16 10:38:09 SCHED 262327ms: gomaxprocs=36 idleprocs=0 threads=56 spinningthreads=18 idlethreads=15 runqueue=1 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 2017/10/16 10:38:10 SCHED 263333ms: gomaxprocs=36 idleprocs=1 threads=56 spinningthreads=18 idlethreads=14 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 2017/10/16 10:38:11 SCHED 264333ms: gomaxprocs=36 idleprocs=4 threads=56 spinningthreads=18 idlethreads=15 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 2017/10/16 10:38:12 SCHED 265336ms: gomaxprocs=36 idleprocs=3 threads=56 spinningthreads=19 idlethreads=14 runqueue=1 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 2017/10/16 10:38:13 SCHED 266339ms: gomaxprocs=36 idleprocs=0 threads=56 spinningthreads=14 idlethreads=15 runqueue=1 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] Any ideas? Thanks in advance! Caleb -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.