I think it would be more helpful if you used gctrace=1 to report on the GC activity.
> On Dec 11, 2018, at 3:37 PM, e...@kasten.io wrote: > > I am observing pause failures in an application and it appears to me that I’m > seeing it take from 5 to 28 seconds or more for the StopTheWorld to take > effect for GC. > > In all likelihood the problem lies in the application, but before I start > changing it or attempting to tune GC, I’d like to understand what’s > happening. I’d appreciate any assistance in decoding what I’m seeing in > schedtrace output. > > Details are below but here are 4 consecutive samples at 1 second intervals. > For brevity I’ve pruned out lines with ‘M.* p=-1’ and ‘G.* status=[34]’ > (syscall or waiting in runtime). > > The first two are after we start the part of the workload that triggers the > failure and before the expected health check arrives. The second two are as > we try to StopTheWorld, which holds off the health check. > > SCHED 674246ms: gomaxprocs=4 idleprocs=4 threads=14 spinningthreads=0 > idlethreads=4 runqueue=0 gcwaiting=0 nmidlelocked=1 stopwait=0 sysmonwait=0 > P0: status=0 schedtick=7112 syscalltick=1916 m=-1 runqsize=0 gfreecnt=6 > P1: status=0 schedtick=3545 syscalltick=1452 m=-1 runqsize=0 gfreecnt=8 > P2: status=0 schedtick=4549 syscalltick=1465 m=-1 runqsize=0 gfreecnt=7 > P3: status=0 schedtick=1616 syscalltick=1223 m=-1 runqsize=0 gfreecnt=0 > M7: p=1 curg=301 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 > helpgc=0 spinning=false blocked=false lockedg=-1 > > ^^^ I suspect G301/M7 was in transition when sampled without locks or in a > state I don’t understand. For reference, it reported: > G301: status=3(IO wait) m=7 lockedm=-1 > > SCHED 675256ms: gomaxprocs=4 idleprocs=2 threads=14 spinningthreads=0 > idlethreads=3 runqueue=0 gcwaiting=0 nmidlelocked=1 stopwait=0 sysmonwait=0 > P0: status=0 schedtick=7118 syscalltick=1921 m=-1 runqsize=0 gfreecnt=6 > P1: status=1 schedtick=3548 syscalltick=2656 m=7 runqsize=0 gfreecnt=8 > P2: status=1 schedtick=4551 syscalltick=1465 m=2 runqsize=0 gfreecnt=7 > P3: status=0 schedtick=1620 syscalltick=1225 m=-1 runqsize=0 gfreecnt=0 > M7: p=1 curg=301 mallocing=1 throwing=0 preemptoff= locks=2 dying=0 > helpgc=0 spinning=false blocked=false lockedg=-1 > M2: p=2 curg=7 mallocing=0 throwing=0 preemptoff= locks=1 dying=0 helpgc=0 > spinning=false blocked=true lockedg=-1 > G7: status=2(GC worker (idle)) m=2 lockedm=-1 > G301: status=2(IO wait) m=7 lockedm=-1 > > ^^^ G301 is now running on M7/P1. M7 is marked as mallocing=1, locks=2 > ^^^ G7 is a GC worker that is now running on M2/P2. M2 has locks=1 > > >>> Health check log expected here > > SCHED 676264ms: gomaxprocs=4 idleprocs=1 threads=14 spinningthreads=0 > idlethreads=2 runqueue=0 gcwaiting=0 nmidlelocked=1 stopwait=0 sysmonwait=0 > P0: status=1 schedtick=7120 syscalltick=1927 m=6 runqsize=0 gfreecnt=5 > P1: status=1 schedtick=3548 syscalltick=2656 m=7 runqsize=0 gfreecnt=8 > P2: status=1 schedtick=4551 syscalltick=1465 m=2 runqsize=0 gfreecnt=7 > P3: status=0 schedtick=1622 syscalltick=1229 m=-1 runqsize=0 gfreecnt=0 > M7: p=1 curg=301 mallocing=1 throwing=0 preemptoff= locks=2 dying=0 > helpgc=0 spinning=false blocked=false lockedg=-1 > M6: p=0 curg=188 mallocing=1 throwing=0 preemptoff= locks=2 dying=0 > helpgc=0 spinning=false blocked=false lockedg=-1 > M2: p=2 curg=7 mallocing=0 throwing=0 preemptoff= locks=1 dying=0 helpgc=0 > spinning=false blocked=true lockedg=-1 > G7: status=2(GC worker (idle)) m=2 lockedm=-1 > G301: status=2(IO wait) m=7 lockedm=-1 > G188: status=2() m=6 lockedm=-1 > > ^^^ total of 3 P in use, 1 GC worker, 2 mallocing > > SCHED 677271ms: gomaxprocs=4 idleprocs=0 threads=14 spinningthreads=0 > idlethreads=3 runqueue=1 gcwaiting=1 nmidlelocked=1 stopwait=1 sysmonwait=0 > P0: status=1 schedtick=7120 syscalltick=1927 m=6 runqsize=0 gfreecnt=5 > P1: status=3 schedtick=3550 syscalltick=2656 m=7 runqsize=0 gfreecnt=8 > P2: status=3 schedtick=4552 syscalltick=1465 m=-1 runqsize=0 gfreecnt=7 > P3: status=3 schedtick=1622 syscalltick=1229 m=-1 runqsize=0 gfreecnt=0 > M7: p=1 curg=6 mallocing=0 throwing=0 preemptoff=gcing locks=0 dying=0 > helpgc=0 spinning=false blocked=true lockedg=-1 > M6: p=0 curg=188 mallocing=1 throwing=0 preemptoff= locks=2 dying=0 > helpgc=0 spinning=false blocked=false lockedg=-1 > G6: status=2(GC worker (idle)) m=7 lockedm=-1 > G301: status=1(IO wait) m=-1 lockedm=-1 > G188: status=2() m=6 lockedm=-1 > > ^^^ Someone has called stopTheWorldWithSema, either directly or via > stopTheWorld. It hasn’t returned because stopwait is > 0. > ^^^. Looks like call is not likely stopTheWorld and caller is likely G6 GC > worker that is running on M7/P1 (stopTheWorldSema marks its own P as gcstop). > ^^^ stopTheWorldWithSema called from gcStart and gcMarkDone is caller. > ^^^ stopTheWorldWithSema is poll-waiting for preemptall() to take effect > (for P0/M6/G188 to notice preemption) > > For the next 4 samples there are no changes in P* or M*. > > After that we see a health check handler logged and in the sample after that > gcwaiting and stopwait are 0. > > What can G188/M6/P0 be doing such that it doesn’t notice the request to > preempt itself for 5 seconds? And while having mallocing=1? > > Thanks for any insights or suggestions, > > Eric > > > ============= > > Details: > > go version go1.11.1 linux/amd64 > System: Containerized Linux (Alpine) > GOMAXPROCS=4 > > Container reports NumCPU=1 so GOMAXPROCS defaults to 1. Problem occurs on > both GOMAXPROCS=1 and GOMAXPROCS=4. > > The application is a fairly basic Kubernetes application with an http server > that services a health check, a separate goroutine that periodically makes an > http client request to poll for work, and a pool of worker goroutines that > pick up work from a channel. The system is configured to issue a health > check every 10 seconds. Intermittently during testing the health check fails > repeatedly and the app is restarted by Kubernetes. > > The test load is not particularly high— often the workers are idle and at > most 2 are active active at a time. The pause failure can be seen with only > 1 request in flight at a time. > > I added a log message to the health check handler, set > GODEBUG=schedtrace=1000,scheddetail=1 and reproduced the failure. When the > system is running normally, we see 1 health check log message every 10 > seconds with 10 SCHED traces in between each. > > In looking at the failure I observed one instance of the health check being > serviced 5+ seconds late after which the system recovered and then a set of > late health checks where it was over 28 seconds and where Kubernetes > terminated the job. > > The extract above is from the 5+ second pause. I happy to share the fuller > logs and also have runtime.MemStat dumped after the second failure. It > reports PauseTotalNs of 39.7 seconds with last 3 pauses being 5.7s, 5.2s, and > 28.6s. > > ======================== > > Even more details: > > Might as well include MemStats: > > memstats > "Alloc”:35820624 > "TotalAlloc”:87362776 > "Sys”:72284408 > "Lookups”:0 > "Mallocs":913917 > > "Frees”:732071 > > "HeapAlloc”:35820624 > > "HeapSys”:66027520 > "HeapIdle”:25460736 > "HeapInuse”:40566784 > "HeapReleased”:0 > "HeapObjects”:181846 > "StackInuse”:1048576 > "StackSys”:1048576 > "MSpanInuse”:361000 > "MSpanSys”:409600 > "MCacheInuse”:6912 > "MCacheSys”:16384 > "BuckHashSys”:1495863 > "GCSys”:2441216 > "OtherSys”:845249 > "NextGC”:42461456 > "LastGC”:1544141699097608371 > "PauseTotalNs”:39724621006 > "PauseNs":[121384 > 3326024 > 3534690 > 525590 > 54645 > 22208743 > 98898 > 250651 > 37238 > 2034238 > 33731 > 38616 > 41163 > 35875 > 37798 > 5794188053 > 5265243818 > 28632809851] > ”PauseEnd":[1544140960164525406 > 1544140960182034741 > 1544140960209848541 > 1544140960278986107 > 1544140960733914972 > 1544140961198224758 > 1544140961290430652 > 1544140961374230055 > 1544140961455867822 > 1544140961561961827 > 1544141081577752402 > 1544141201598823342 > 1544141321622343360 > 1544141441643055123 > 1544141561665242451 > 1544141599296615758 > 1544141641227906209 > 1544141699097608371] > "NumGC”:18 > "NumForcedGC”:0 > "GCCPUFraction”:0.013482378367869897 > "EnableGC”:true > "DebugGC”:false > "BySize”:[ > {"Size":0,"Mallocs":0,"Frees":0} > {"Size":8,"Mallocs":19587,"Frees":16601} > {"Size":16,"Mallocs":329030,"Frees":254713} > {"Size":32,"Mallocs":169516,"Frees":119550} > {"Size":48,"Mallocs":89240,"Frees":73834} > {"Size":64,"Mallocs":27676,"Frees":20629} > {"Size":80,"Mallocs":19143,"Frees":13042} > {"Size":96,"Mallocs":8042,"Frees":6076} > {"Size":112,"Mallocs":12250,"Frees":10569} > {"Size":128,"Mallocs":23168,"Frees":16902} > {"Size":144,"Mallocs":3305,"Frees":1499} > {"Size":160,"Mallocs":29309,"Frees":28151} > {"Size":176,"Mallocs":799,"Frees":325} > {"Size":192,"Mallocs":1486,"Frees":1043} > {"Size":208,"Mallocs":2884,"Frees":1687} > {"Size":224,"Mallocs":1196,"Frees":530} > {"Size":240,"Mallocs":823,"Frees":671} > {"Size":256,"Mallocs":4766,"Frees":4442} > {"Size":288,"Mallocs":31701,"Frees":30325} > {"Size":320,"Mallocs":2082,"Frees":1431} > {"Size":352,"Mallocs":2775,"Frees":2370} > {"Size":384,"Mallocs":387,"Frees":250} > {"Size":416,"Mallocs":429,"Frees":284} > {"Size":448,"Mallocs":229,"Frees":116} > {"Size":480,"Mallocs":151,"Frees":133} > {"Size":512,"Mallocs":3240,"Frees":3053} > {"Size":576,"Mallocs":9709,"Frees":5829} > {"Size":640,"Mallocs":192,"Frees":137} > {"Size":704,"Mallocs":203,"Frees":131} > {"Size":768,"Mallocs":739,"Frees":427} > {"Size":896,"Mallocs":829,"Frees":401} > {"Size":1024,"Mallocs":2817,"Frees":2677} > {"Size":1152,"Mallocs":983,"Frees":812} > {"Size":1280,"Mallocs":467,"Frees":109} > {"Size":1408,"Mallocs":923,"Frees":335} > {"Size":1536,"Mallocs":429,"Frees":289} > {"Size":1792,"Mallocs":644,"Frees":547} > {"Size":2048,"Mallocs":390,"Frees":210} > {"Size":2304,"Mallocs":249,"Frees":144} > {"Size":2688,"Mallocs":190,"Frees":106} > {"Size":3072,"Mallocs":20,"Frees":16} > {"Size":3200,"Mallocs":23,"Frees":13} > {"Size":3456,"Mallocs":14,"Frees":11} > {"Size":4096,"Mallocs":201,"Frees":127} > {"Size":4864,"Mallocs":52,"Frees":42} > {"Size":5376,"Mallocs":29,"Frees":8} > {"Size":6144,"Mallocs":12,"Frees":7} > {"Size":6528,"Mallocs":11,"Frees":3} > {"Size":6784,"Mallocs":3,"Frees":1} > {"Size":6912,"Mallocs":0,"Frees":0} > {"Size":8192,"Mallocs":50,"Frees":32} > {"Size":9472,"Mallocs":46,"Frees":24} > {"Size":9728,"Mallocs":0,"Frees":0} > {"Size":10240,"Mallocs":0,"Frees":0} > {"Size":10880,"Mallocs":2,"Frees":1} > {"Size":12288,"Mallocs":3,"Frees":0} > {"Size":13568,"Mallocs":0,"Frees":0} > {"Size":14336,"Mallocs":7,"Frees":3} > {"Size":16384,"Mallocs":30,"Frees":17} > {"Size":18432,"Mallocs":3,"Frees":3} > {"Size":19072,"Mallocs":16,"Frees":12} > ]} > > > Pause durations in milliseconds and end times: > 0 2018-12-07 00:02:40.164525406 +0000 UTC > 3 2018-12-07 00:02:40.182034741 +0000 UTC > 3 2018-12-07 00:02:40.209848541 +0000 UTC > 0 2018-12-07 00:02:40.278986107 +0000 UTC > 0 2018-12-07 00:02:40.733914972 +0000 UTC > 22 2018-12-07 00:02:41.198224758 +0000 UTC > 0 2018-12-07 00:02:41.290430652 +0000 UTC > 0 2018-12-07 00:02:41.374230055 +0000 UTC > 0 2018-12-07 00:02:41.455867822 +0000 UTC > 2 2018-12-07 00:02:41.561961827 +0000 UTC > 0 2018-12-07 00:04:41.577752402 +0000 UTC > 0 2018-12-07 00:06:41.598823342 +0000 UTC > 0 2018-12-07 00:08:41.62234336 +0000 UTC > 0 2018-12-07 00:10:41.643055123 +0000 UTC > 0 2018-12-07 00:12:41.665242451 +0000 UTC > 5794 2018-12-07 00:13:19.296615758 +0000 UTC > 5265 2018-12-07 00:14:01.227906209 +0000 UTC > 28632 2018-12-07 00:14:59.097608371 +0000 UTC > > > > > > -- > You received this message because you are subscribed to the Google Groups > "golang-nuts" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-nuts+unsubscr...@googlegroups.com > <mailto:golang-nuts+unsubscr...@googlegroups.com>. > For more options, visit https://groups.google.com/d/optout > <https://groups.google.com/d/optout>. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.