Re: [go-nuts] Pause failures, GC, & StopTheWorld

robert engels Tue, 11 Dec 2018 16:32:48 -0800

I think it would be more helpful if you used gctrace=1 to report on the GC 
activity.


> On Dec 11, 2018, at 3:37 PM, e...@kasten.io wrote:
> 
> I am observing pause failures in an application and it appears to me that I’m 
> seeing it take from 5 to 28 seconds or more for the StopTheWorld to take 
> effect for GC.
> 
> In all likelihood the problem lies in the application, but before I start 
> changing it or attempting to tune GC, I’d like to understand what’s 
> happening.  I’d appreciate any assistance in decoding what I’m seeing in 
> schedtrace output.
> 
> Details are below but here are 4 consecutive samples at 1 second intervals.  
> For brevity I’ve pruned out lines with ‘M.* p=-1’ and ‘G.* status=[34]’ 
> (syscall or waiting in runtime).
> 
> The first two are after we start the part of the workload that triggers the 
> failure and before the expected health check arrives.  The second two are as 
> we try to StopTheWorld, which holds off the health check.
> 
> SCHED 674246ms: gomaxprocs=4 idleprocs=4 threads=14 spinningthreads=0 
> idlethreads=4 runqueue=0 gcwaiting=0 nmidlelocked=1 stopwait=0 sysmonwait=0
>   P0: status=0 schedtick=7112 syscalltick=1916 m=-1 runqsize=0 gfreecnt=6
>   P1: status=0 schedtick=3545 syscalltick=1452 m=-1 runqsize=0 gfreecnt=8
>   P2: status=0 schedtick=4549 syscalltick=1465 m=-1 runqsize=0 gfreecnt=7
>   P3: status=0 schedtick=1616 syscalltick=1223 m=-1 runqsize=0 gfreecnt=0
>   M7: p=1 curg=301 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 
> helpgc=0 spinning=false blocked=false lockedg=-1
> 
> ^^^ I suspect G301/M7 was in transition when sampled without locks or in a 
> state I don’t understand.  For reference, it reported:
>    G301: status=3(IO wait) m=7 lockedm=-1
> 
> SCHED 675256ms: gomaxprocs=4 idleprocs=2 threads=14 spinningthreads=0 
> idlethreads=3 runqueue=0 gcwaiting=0 nmidlelocked=1 stopwait=0 sysmonwait=0
>   P0: status=0 schedtick=7118 syscalltick=1921 m=-1 runqsize=0 gfreecnt=6
>   P1: status=1 schedtick=3548 syscalltick=2656 m=7 runqsize=0 gfreecnt=8
>   P2: status=1 schedtick=4551 syscalltick=1465 m=2 runqsize=0 gfreecnt=7
>   P3: status=0 schedtick=1620 syscalltick=1225 m=-1 runqsize=0 gfreecnt=0
>   M7: p=1 curg=301 mallocing=1 throwing=0 preemptoff= locks=2 dying=0 
> helpgc=0 spinning=false blocked=false lockedg=-1
>   M2: p=2 curg=7 mallocing=0 throwing=0 preemptoff= locks=1 dying=0 helpgc=0 
> spinning=false blocked=true lockedg=-1
>   G7: status=2(GC worker (idle)) m=2 lockedm=-1
>   G301: status=2(IO wait) m=7 lockedm=-1
> 
> ^^^ G301 is now running on M7/P1.  M7 is marked as mallocing=1, locks=2
> ^^^ G7 is a GC worker that is now running on M2/P2.  M2 has locks=1
> 
> >>> Health check log expected here
> 
> SCHED 676264ms: gomaxprocs=4 idleprocs=1 threads=14 spinningthreads=0 
> idlethreads=2 runqueue=0 gcwaiting=0 nmidlelocked=1 stopwait=0 sysmonwait=0
>   P0: status=1 schedtick=7120 syscalltick=1927 m=6 runqsize=0 gfreecnt=5
>   P1: status=1 schedtick=3548 syscalltick=2656 m=7 runqsize=0 gfreecnt=8
>   P2: status=1 schedtick=4551 syscalltick=1465 m=2 runqsize=0 gfreecnt=7
>   P3: status=0 schedtick=1622 syscalltick=1229 m=-1 runqsize=0 gfreecnt=0
>   M7: p=1 curg=301 mallocing=1 throwing=0 preemptoff= locks=2 dying=0 
> helpgc=0 spinning=false blocked=false lockedg=-1
>   M6: p=0 curg=188 mallocing=1 throwing=0 preemptoff= locks=2 dying=0 
> helpgc=0 spinning=false blocked=false lockedg=-1
>   M2: p=2 curg=7 mallocing=0 throwing=0 preemptoff= locks=1 dying=0 helpgc=0 
> spinning=false blocked=true lockedg=-1
>   G7: status=2(GC worker (idle)) m=2 lockedm=-1
>   G301: status=2(IO wait) m=7 lockedm=-1
>   G188: status=2() m=6 lockedm=-1
> 
> ^^^ total of 3 P in use, 1 GC worker, 2 mallocing
> 
> SCHED 677271ms: gomaxprocs=4 idleprocs=0 threads=14 spinningthreads=0 
> idlethreads=3 runqueue=1 gcwaiting=1 nmidlelocked=1 stopwait=1 sysmonwait=0
>   P0: status=1 schedtick=7120 syscalltick=1927 m=6 runqsize=0 gfreecnt=5
>   P1: status=3 schedtick=3550 syscalltick=2656 m=7 runqsize=0 gfreecnt=8
>   P2: status=3 schedtick=4552 syscalltick=1465 m=-1 runqsize=0 gfreecnt=7
>   P3: status=3 schedtick=1622 syscalltick=1229 m=-1 runqsize=0 gfreecnt=0
>   M7: p=1 curg=6 mallocing=0 throwing=0 preemptoff=gcing locks=0 dying=0 
> helpgc=0 spinning=false blocked=true lockedg=-1
>   M6: p=0 curg=188 mallocing=1 throwing=0 preemptoff= locks=2 dying=0 
> helpgc=0 spinning=false blocked=false lockedg=-1
>   G6: status=2(GC worker (idle)) m=7 lockedm=-1
>   G301: status=1(IO wait) m=-1 lockedm=-1
>   G188: status=2() m=6 lockedm=-1
> 
> ^^^ Someone has called stopTheWorldWithSema, either directly or via 
> stopTheWorld.  It hasn’t returned because stopwait is > 0.
> ^^^.   Looks like call is not likely stopTheWorld and caller is likely G6 GC 
> worker that is running on M7/P1 (stopTheWorldSema marks its own P as gcstop).
> ^^^    stopTheWorldWithSema called from gcStart and gcMarkDone is caller.
> ^^^    stopTheWorldWithSema is poll-waiting for preemptall() to take effect 
> (for P0/M6/G188 to notice preemption)
> 
> For the next 4 samples there are no changes in P* or M*.
> 
> After that we see a health check handler logged and in the sample after that 
> gcwaiting and stopwait are 0.
> 
> What can G188/M6/P0 be doing such that it doesn’t notice the request to 
> preempt itself for 5 seconds?  And while having mallocing=1?
> 
> Thanks for any insights or suggestions,
> 
> Eric
> 
> 
> =============
> 
> Details:
> 
> go version go1.11.1 linux/amd64
> System: Containerized Linux (Alpine)
> GOMAXPROCS=4
> 
> Container reports NumCPU=1 so GOMAXPROCS defaults to 1.  Problem occurs on 
> both GOMAXPROCS=1 and GOMAXPROCS=4.
> 
> The application is a fairly basic Kubernetes application with an http server 
> that services a health check, a separate goroutine that periodically makes an 
> http client request to poll for work, and a pool of worker goroutines that 
> pick up work from a channel.  The system is configured to issue a health 
> check every 10 seconds.  Intermittently during testing the health check fails 
> repeatedly and the app is restarted by Kubernetes.
> 
> The test load is not particularly high— often the workers are idle and at 
> most 2 are active active at a time.  The pause failure can be seen with only 
> 1 request in flight at a time. 
> 
> I added a log message to the health check handler, set 
> GODEBUG=schedtrace=1000,scheddetail=1 and reproduced the failure.  When the 
> system is running normally, we see 1 health check log message every 10 
> seconds with 10 SCHED traces in between each.
> 
> In looking at the failure I observed one instance of the health check being 
> serviced 5+ seconds late after which the system recovered and then a set of 
> late health checks where it was over 28 seconds and where Kubernetes 
> terminated the job.
> 
> The extract above is from the 5+ second pause.  I happy to share the fuller 
> logs and also have runtime.MemStat dumped after the second failure.  It 
> reports PauseTotalNs of 39.7 seconds with last 3 pauses being 5.7s, 5.2s, and 
> 28.6s. 
> 
> ========================
> 
> Even more details:
> 
> Might as well include MemStats:
> 
> memstats
> "Alloc”:35820624
> "TotalAlloc”:87362776
> "Sys”:72284408
> "Lookups”:0
> "Mallocs":913917 > 
> "Frees”:732071
>  > "HeapAlloc”:35820624
>  > "HeapSys”:66027520
> "HeapIdle”:25460736
> "HeapInuse”:40566784
> "HeapReleased”:0
> "HeapObjects”:181846
> "StackInuse”:1048576
> "StackSys”:1048576
> "MSpanInuse”:361000
> "MSpanSys”:409600
> "MCacheInuse”:6912
> "MCacheSys”:16384
> "BuckHashSys”:1495863
> "GCSys”:2441216
> "OtherSys”:845249
> "NextGC”:42461456
> "LastGC”:1544141699097608371
> "PauseTotalNs”:39724621006
> "PauseNs":[121384
> 3326024
> 3534690
> 525590
> 54645
> 22208743
> 98898
> 250651
> 37238
> 2034238
> 33731
> 38616
> 41163
> 35875
> 37798
> 5794188053
> 5265243818
> 28632809851]
> ”PauseEnd":[1544140960164525406
> 1544140960182034741
> 1544140960209848541
> 1544140960278986107
> 1544140960733914972
> 1544140961198224758
> 1544140961290430652
> 1544140961374230055
> 1544140961455867822
> 1544140961561961827
> 1544141081577752402
> 1544141201598823342
> 1544141321622343360
> 1544141441643055123
> 1544141561665242451
> 1544141599296615758
> 1544141641227906209
> 1544141699097608371]
> "NumGC”:18
> "NumForcedGC”:0
> "GCCPUFraction”:0.013482378367869897
> "EnableGC”:true
> "DebugGC”:false
> "BySize”:[
> {"Size":0,"Mallocs":0,"Frees":0}
> {"Size":8,"Mallocs":19587,"Frees":16601}
> {"Size":16,"Mallocs":329030,"Frees":254713}
> {"Size":32,"Mallocs":169516,"Frees":119550}
> {"Size":48,"Mallocs":89240,"Frees":73834}
> {"Size":64,"Mallocs":27676,"Frees":20629}
> {"Size":80,"Mallocs":19143,"Frees":13042}
> {"Size":96,"Mallocs":8042,"Frees":6076}
> {"Size":112,"Mallocs":12250,"Frees":10569}
> {"Size":128,"Mallocs":23168,"Frees":16902}
> {"Size":144,"Mallocs":3305,"Frees":1499}
> {"Size":160,"Mallocs":29309,"Frees":28151}
> {"Size":176,"Mallocs":799,"Frees":325}
> {"Size":192,"Mallocs":1486,"Frees":1043}
> {"Size":208,"Mallocs":2884,"Frees":1687}
> {"Size":224,"Mallocs":1196,"Frees":530}
> {"Size":240,"Mallocs":823,"Frees":671}
> {"Size":256,"Mallocs":4766,"Frees":4442}
> {"Size":288,"Mallocs":31701,"Frees":30325}
> {"Size":320,"Mallocs":2082,"Frees":1431}
> {"Size":352,"Mallocs":2775,"Frees":2370}
> {"Size":384,"Mallocs":387,"Frees":250}
> {"Size":416,"Mallocs":429,"Frees":284}
> {"Size":448,"Mallocs":229,"Frees":116}
> {"Size":480,"Mallocs":151,"Frees":133}
> {"Size":512,"Mallocs":3240,"Frees":3053}
> {"Size":576,"Mallocs":9709,"Frees":5829}
> {"Size":640,"Mallocs":192,"Frees":137}
> {"Size":704,"Mallocs":203,"Frees":131}
> {"Size":768,"Mallocs":739,"Frees":427}
> {"Size":896,"Mallocs":829,"Frees":401}
> {"Size":1024,"Mallocs":2817,"Frees":2677}
> {"Size":1152,"Mallocs":983,"Frees":812}
> {"Size":1280,"Mallocs":467,"Frees":109}
> {"Size":1408,"Mallocs":923,"Frees":335}
> {"Size":1536,"Mallocs":429,"Frees":289}
> {"Size":1792,"Mallocs":644,"Frees":547}
> {"Size":2048,"Mallocs":390,"Frees":210}
> {"Size":2304,"Mallocs":249,"Frees":144}
> {"Size":2688,"Mallocs":190,"Frees":106}
> {"Size":3072,"Mallocs":20,"Frees":16}
> {"Size":3200,"Mallocs":23,"Frees":13}
> {"Size":3456,"Mallocs":14,"Frees":11}
> {"Size":4096,"Mallocs":201,"Frees":127}
> {"Size":4864,"Mallocs":52,"Frees":42}
> {"Size":5376,"Mallocs":29,"Frees":8}
> {"Size":6144,"Mallocs":12,"Frees":7}
> {"Size":6528,"Mallocs":11,"Frees":3}
> {"Size":6784,"Mallocs":3,"Frees":1}
> {"Size":6912,"Mallocs":0,"Frees":0}
> {"Size":8192,"Mallocs":50,"Frees":32}
> {"Size":9472,"Mallocs":46,"Frees":24}
> {"Size":9728,"Mallocs":0,"Frees":0}
> {"Size":10240,"Mallocs":0,"Frees":0}
> {"Size":10880,"Mallocs":2,"Frees":1}
> {"Size":12288,"Mallocs":3,"Frees":0}
> {"Size":13568,"Mallocs":0,"Frees":0}
> {"Size":14336,"Mallocs":7,"Frees":3}
> {"Size":16384,"Mallocs":30,"Frees":17}
> {"Size":18432,"Mallocs":3,"Frees":3}
> {"Size":19072,"Mallocs":16,"Frees":12}
> ]}
> 
> 
> Pause durations in milliseconds and end times:
>     0 2018-12-07 00:02:40.164525406 +0000 UTC
>     3 2018-12-07 00:02:40.182034741 +0000 UTC
>     3 2018-12-07 00:02:40.209848541 +0000 UTC
>     0 2018-12-07 00:02:40.278986107 +0000 UTC
>     0 2018-12-07 00:02:40.733914972 +0000 UTC
>    22 2018-12-07 00:02:41.198224758 +0000 UTC
>     0 2018-12-07 00:02:41.290430652 +0000 UTC
>     0 2018-12-07 00:02:41.374230055 +0000 UTC
>     0 2018-12-07 00:02:41.455867822 +0000 UTC
>     2 2018-12-07 00:02:41.561961827 +0000 UTC
>     0 2018-12-07 00:04:41.577752402 +0000 UTC
>     0 2018-12-07 00:06:41.598823342 +0000 UTC
>     0 2018-12-07 00:08:41.62234336 +0000 UTC
>     0 2018-12-07 00:10:41.643055123 +0000 UTC
>     0 2018-12-07 00:12:41.665242451 +0000 UTC
>  5794 2018-12-07 00:13:19.296615758 +0000 UTC
>  5265 2018-12-07 00:14:01.227906209 +0000 UTC
> 28632 2018-12-07 00:14:59.097608371 +0000 UTC
> 
> 
> 
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to golang-nuts+unsubscr...@googlegroups.com 
> <mailto:golang-nuts+unsubscr...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout 
> <https://groups.google.com/d/optout>.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [go-nuts] Pause failures, GC, & StopTheWorld

Reply via email to