Yeah, I've been looking at the goroutine profiles and there are some strange stacks like the below.
1 reflect.flag.mustBeExported /go/src/reflect/value.go:213 reflect.Value.call /go/src/reflect/value.go:424 reflect.Value.Call /go/src/reflect/value.go:308 net/http/httptrace.(*ClientTrace).compose.func1 /go/src/net/http/httptrace/trace.go:203 reflect.callReflect /go/src/reflect/value.go:536 reflect.makeFuncStub /go/src/reflect/asm_amd64.s:20 reflect.Value.call /go/src/reflect/value.go:447 reflect.Value.Call /go/src/reflect/value.go:308 net/http/httptrace.(*ClientTrace).compose.func1 /go/src/net/http/httptrace/trace.go:204 reflect.callReflect /go/src/reflect/value.go:536 reflect.makeFuncStub /go/src/reflect/asm_amd64.s:20 reflect.Value.call /go/src/reflect/value.go:447 reflect.Value.Call /go/src/reflect/value.go:308 net/http/httptrace.(*ClientTrace).compose.func1 /go/src/net/http/httptrace/trace.go:204 reflect.callReflect /go/src/reflect/value.go:536 reflect.makeFuncStub /go/src/reflect/asm_amd64.s:20 reflect.Value.call /go/src/reflect/value.go:447 reflect.Value.Call /go/src/reflect/value.go:308 net/http/httptrace.(*ClientTrace).compose.func1 /go/src/net/http/httptrace/trace.go:204 reflect.callReflect /go/src/reflect/value.go:536 reflect.makeFuncStub /go/src/reflect/asm_amd64.s:20 reflect.Value.call /go/src/reflect/value.go:447 reflect.Value.Call /go/src/reflect/value.go:308 net/http/httptrace.(*ClientTrace).compose.func1 /go/src/net/http/httptrace/trace.go:204 reflect.callReflect /go/src/reflect/value.go:536 reflect.makeFuncStub /go/src/reflect/asm_amd64.s:20 reflect.Value.call /go/src/reflect/value.go:447 reflect.Value.Call /go/src/reflect/value.go:308 net/http/httptrace.(*ClientTrace).compose.func1 /go/src/net/http/httptrace/trace.go:204 reflect.callReflect /go/src/reflect/value.go:536 reflect.makeFuncStub /go/src/reflect/asm_amd64.s:20 reflect.Value.call /go/src/reflect/value.go:447 The reflect.MakeFunc is invoked by the httptrace package. From what I could gather, when you call httptrace.WithClientTrace(context.Context, *httptrace.ClientTrace), the package will use reflection to compose any existing httptrace.ClientTrace objects on the context with the trace you provide. I only provide 1 trace, so it's strange there are 6 traces being composed. The code for adding the trace is pretty straightforward: // This is called exactly once on service startup to insert a hook before every request to DynamoDB (from the Go AWS SDK) func instrument(dynamo *dynamodb.DynamoDB) { dynamo.Client.Handlers.Send.PushFront(metricHandler()) } func metricHandler() func(r *request.Request) { // There should be exactly 1 instance of this trace object, since the metricHandler() function is invoked once tracer := &httptrace.ClientTrace{ ConnectStart: func(_, _ string) { // Incrementing a prometheus counter newConns.Inc() }, } return func(r *request.Request) { // This should get applied once per request r.SetContext(httptrace.WithClientTrace(r.Context(), tracer)) } } I've done a little digging along this vector, but there isn't much to look at. A value from httptrace is applied to a hook in the AWS Go SDK. On Tue, Jul 2, 2019 at 3:46 PM andrey mirtchovski <mirtchov...@gmail.com> wrote: > ok, this is interesting: > > reflect.MakeFunc: i've never done this before. what are the allocation > patterns for creating functions with reflect? i see a few crashes > related to these functions but no mentioning of severe memory > consumption. > > in my opinion, trying to capture MakeFunc patterns from your original > program in a small example will be helpful. > > On Tue, Jul 2, 2019 at 1:28 PM Bakul Shah <ba...@bitblocks.com> wrote: > > > > Switching Go version seems like a stab in the dark. If the OOM symptom > does show up, you have simply wasted time. If it doesn't show up, you still > don't know if the bug exists and is simply hiding. Even if you think the > bug in Go code generation (or GC) and not in your code, there is nothing > the Go developers can do without a concrete test. So then you have to keep > changing things until the symptom disappears for a long time and cross > fingers the bug went away.... > > > > If the symptom disappears when you take out httptrace, now you have > narrowed the hypothesis to httptrace but you haven't proven it. You still > don't know if removing httptrace changed the timing behavior such that the > bug is hiding somewhere and if it will show up in future. > > > > This is why one should change as little as possible outside of what is > needed for testing a specific hypothesis and why you should have a test > that tickles the bug. By "proving" I mean you must find a specific > assertion that is violated. > > > > I'd instrument the code under question and somehow capture the history > of last N seconds just before OOM. Capturing heap profile is just one way > to look at what your code does. You can find other ways to look at what > your code does and create a log. For instance, the extra memory use is also > a symptom just like OOM; what you may want to check is whether your data > structures are consistent. The actual inconsistency may have occurred long > before the OOM crash. > > > > Note that the GC logs allow you to see some aspect of the GC behavior > better but if you don't understand it well enough, it may seem mysterious > (compared to your mental model) and later, when you run out of other > hypotheses, it may even seem suspicious, so I'd be careful about looking at > such things :-) > > > > On Jul 2, 2019, at 11:16 AM, 'Yunchi Luo' via golang-nuts < > golang-nuts@googlegroups.com> wrote: > > > > I removed the httptrace call yesterday and there have been no OOMs yet. > Going to let it bake for another day. If OOMs show up again, I'll try > reverting to an older Go version tomorrow. Otherwise I'll point my finger > at httptrace I guess. > > > > On Tue, Jul 2, 2019 at 2:15 PM Yunchi Luo <yunchi...@squareup.com> > wrote: > >> > >> I did try to do that! I have 3 heap profiles captured from the ~3 > seconds before crash. The only thing particularly suspicious is the > httptrace call I mentioned earlier in the thread. > >> > >> Diffing 1 to 2 > >> (pprof) cum > >> (pprof) top 50 > >> Showing nodes accounting for 4604.15kB, 81.69% of 5636.17kB total > >> flat flat% sum% cum cum% > >> 0 0% 0% 5120.16kB 90.84% > net/http/httptrace.(*ClientTrace).compose.func1 > >> 0 0% 0% 5120.16kB 90.84% reflect.Value.Call > >> 0 0% 0% 5120.16kB 90.84% reflect.Value.call > >> 0 0% 0% 5120.16kB 90.84% reflect.callReflect > >> 5120.16kB 90.84% 90.84% 5120.16kB 90.84% reflect.funcLayout.func1 > >> 0 0% 90.84% 5120.16kB 90.84% reflect.makeFuncStub > >> 0 0% 90.84% 4604.15kB 81.69% sync.(*Pool).Get > >> 0 0% 90.84% -516.01kB 9.16% io.Copy > >> 0 0% 90.84% -516.01kB 9.16% io.copyBuffer > >> 0 0% 90.84% -516.01kB 9.16% io/ioutil.devNull.ReadFrom > >> -516.01kB 9.16% 81.69% -516.01kB 9.16% io/ioutil.glob..func1 > >> > >> Diff 2 to 3 > >> (pprof) top 50 > >> Showing nodes accounting for 7680.44kB, 100% of 7680.44kB total > >> flat flat% sum% cum cum% > >> 0 0% 0% 6144.18kB 80.00% > net/http/httptrace.(*ClientTrace).compose.func1 > >> 0 0% 0% 6144.18kB 80.00% reflect.Value.Call > >> 0 0% 0% 6144.18kB 80.00% reflect.Value.call > >> 512.01kB 6.67% 6.67% 6144.18kB 80.00% reflect.callReflect > >> 0 0% 6.67% 6144.18kB 80.00% reflect.makeFuncStub > >> 5632.17kB 73.33% 80.00% 5632.17kB 73.33% reflect.funcLayout.func1 > >> 0 0% 80.00% 5632.17kB 73.33% sync.(*Pool).Get > >> 0 0% 80.00% 1024.23kB 13.34% > github.com/aws/aws-sdk-go/aws/request.(*HandlerList).Run > >> 0 0% 80.00% 1024.23kB 13.34% > github.com/aws/aws-sdk-go/aws/request.(*Request).Send > >> 0 0% 80.00% 1024.23kB 13.34% > github.com/aws/aws-sdk-go/aws/request.(*Request).Sign > >> 0 0% 80.00% 1024.23kB 13.34% > github.com/aws/aws-sdk-go/aws/signer/v4.(*signingCtx).build > >> 0 0% 80.00% 1024.23kB 13.34% > github.com/aws/aws-sdk-go/aws/signer/v4.SignSDKRequest > >> 0 0% 80.00% 1024.23kB 13.34% > github.com/aws/aws-sdk-go/aws/signer/v4.SignSDKRequestWithCurrentTime > >> 0 0% 80.00% 1024.23kB 13.34% > github.com/aws/aws-sdk-go/aws/signer/v4.Signer.signWithBody > >> 0 0% 80.00% 1024.23kB 13.34% > github.com/aws/aws-sdk-go/service/dynamodb.(*DynamoDB).GetItemWithContext > >> > >> On Tue, Jul 2, 2019 at 2:08 PM andrey mirtchovski < > mirtchov...@gmail.com> wrote: > >>> > >>> What I have found useful in the past is pprof's ability to diff > profiles. That means that if you capture heap profiles at regular intervals > you can see a much smaller subset of changes and compare allocation > patterns. > >>> > >>> On Tue, Jul 2, 2019, 10:53 AM 'Yunchi Luo' via golang-nuts < > golang-nuts@googlegroups.com> wrote: > >>>> > >>>> I'm not so much pointing my finger at GC as I am hoping GC logs could > help tell the story, and that someone with a strong understanding of GC in > Go could weigh in here. In the last 4 seconds before OOM, "TotalAlloc" > increased by only 80M, yet "HeapIdle" increased to 240M from 50M, RSS > increased by 810M. The numbers don't add up for me. A running sum of 80M of > heap objects were allocated in the time RSS increased by 10X that. If GC > was completely off, I still wouldn't expect this to happen, which makes me > want to rule out GC being blocked as a problem. Maybe there was runaway > heap reservation because something in my code caused a ton of > fragmentation? Is that sane? The heap profile lacking clues is also strange. > >>>> > >>>> Regarding the possibility of a race, I forgot I do have goroutine > profiles captured along with the heap profiles at the time memory exploded. > There are only 10 goroutines running on the serving path, which rules out > too many concurrent requests being served (please correct me if I'm wrong). > Those fan out to 13 goroutines talking to the db, all of which are in > http.Transport.roundTrip (which is blocked on runtime.gopark so I assume > they are waiting on the TCP connection). All other goroutines that don't > originate in the stdlib are also blocked on `select` or sleeping. Our CI > does run with go test -race, but I'll try doing some load testing with a > race detector enabled binary. > >>>> > >>>> Bakul, that is sound advice. I've actually been debugging this on and > off for a couple months now, with the help of several people, a few of > which have peer reviewed the code. I agree it's most likely to be some > runaway code that I caused in my logic, but we haven't been able to > pin-point the cause and I've run out of hypothesis to test at the moment. > This is why I've started asking on go-nuts@. The circuit breaker code was > one of the first things I checked, has been unit tested and verified > working with load tests. Now that you mention it, I actually did uncover a > Go stdlib bug in http2 while doing the load tests... but that's unrelated. > >>>> > >>>> > >>>> On Tue, Jul 2, 2019 at 2:24 AM Bakul Shah <ba...@bitblocks.com> > wrote: > >>>>> > >>>>> Before assuming it is the GC or something system related, you may > wish to verify it is *not your own logic*. Larger RSS could also be due to > your own logic touching more and more memory due to some runaway effect. > The probability this has to do with GC is very low given the very > widespread use of Go and the probability of a bug in new code is very high > given it is used on a much much smaller scale. > >>>>> > >>>>> This has the "smell" of a concurrency bug. If I were you I'd test > the code for any races, I'd review the code thoroughly with someone who > doesn't know the code so that I'm forced to explain it, and I'd add plenty > of assertions. I'd probably first look at the circuit breaker code -- > things like how does it know how many concurrent connections exist? > >>>>> > >>>>> In general, any hypothesis you come up with, you should have a test > that *catches* the bug given the hypothesis. Elusive bugs tend to become > more elusive as you are on the hunt and as you may fix other problems you > discover on the way. > >>>>> > >>>>> I even suspect you're looking at GC logs a bit too early. Instrument > your own code and look at what patterns emerge. [Not to mention any time > you spend on understanding your code will help improve your service; but > better understanding of and debugging the GC won't necessarily help you!] > >>>>> > >>>>> On Jul 1, 2019, at 12:14 PM, 'Yunchi Luo' via golang-nuts < > golang-nuts@googlegroups.com> wrote: > >>>>> > >>>>> Hello, I'd like to solicit some help with a weird GC issue we are > seeing. > >>>>> > >>>>> I'm trying to debug OOM on a service we are running in k8s. The > service is just a CRUD server hitting a database (DynamoDB). Each replica > serves about 300 qps of traffic. There are no memory leaks. On occasion > (seemingly correlated to small latency spikes on the backend), the service > would OOM. This is surprising because it has a circuit breaker that drops > requests after 200 concurrent connections that has never trips, and > goroutine profiles confirm that there are nowhere 200 active goroutines. > >>>>> > >>>>> GC logs are pasted below. It's interlaced with dumps of > runtime.Memstats (the RSS number is coming from /proc/<pid>/stats). Go > version is 1.12.5, running an Alpine 3.10 container in an Amazon kernel > 4.14.123-111.109.amzn2.x86_64. > >>>>> > >>>>> The service happily serves requests using ~50MB of RSS for hours, > until the last 2 seconds, where GC mark&sweep time starts to 2-4X per cycle > (43+489/158/0.60+0.021 ms cpu => 43+489/158/0.60+0.021 ms cpu), and RSS and > Sys blow up. It’s also interesting that in the last log line: `Sys=995MB > RSS=861MB HeapSys=199MB`. If I’m reading this correctly, there’s at least > `662MB` of memory in RSS that is not assigned to the heap. Though this > might be due to the change in 1.125 to use MADV_FREE, so the pages are > freeable not yet reclaimed by the kernel. > >>>>> > >>>>> I don’t understand how heap can be so small across gc cycles > (28->42->30MB on the last line means heap doesn't grow past 42MB?), yet RSS > keeps growing. I'm assuming the increased RSS is causing the kernel to OOM > the service, but that should only happen if the RSS is not freeable as > marked by MADV_FREE. There doesn't seem to be any indication of that from > the GC logs. I guess this all comes down to me not having a good > understanding of how the GC algorithm works and how to read these logs. I'd > really appreciate it if anyone can explain what's happening and why. > >>>>> > >>>>> gc 41833 @19135.227s 0%: 0.019+2.3+0.005 ms clock, > 0.079+0.29/2.2/5.6+0.020 ms cpu, 11->11->5 MB, 12 MB goal, 4 P > >>>>> INFO 2019-06-30T08:46:04.886 [Memory]: Alloc=7MB TotalAlloc=230172MB > Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=51MB HeapInUse=11MB HeapReleased=5MB > >>>>> gc 41834 @19135.869s 0%: 0.005+2.9+0.003 ms clock, > 0.023+0.32/2.5/6.6+0.012 ms cpu, 11->11->5 MB, 12 MB goal, 4 P > >>>>> INFO 2019-06-30T08:46:05.886 [Memory]: Alloc=9MB TotalAlloc=230179MB > Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB > >>>>> gc 41835 @19136.704s 0%: 0.038+2.1+0.004 ms clock, > 0.15+0.35/2.1/5.3+0.016 ms cpu, 11->11->5 MB, 12 MB goal, 4 P > >>>>> INFO 2019-06-30T08:46:06.886 [Memory]: Alloc=9MB TotalAlloc=230184MB > Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB > >>>>> gc 41836 @19137.611s 0%: 0.009+2.1+0.003 ms clock, > 0.036+0.39/2.0/5.7+0.015 ms cpu, 11->11->5 MB, 12 MB goal, 4 P > >>>>> INFO 2019-06-30T08:46:07.887 [Memory]: Alloc=10MB > TotalAlloc=230190MB Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=49MB > HeapInUse=12MB HeapReleased=5MB > >>>>> gc 41837 @19138.444s 0%: 0.008+2.1+0.004 ms clock, > 0.035+0.51/2.1/5.7+0.017 ms cpu, 11->11->5 MB, 12 MB goal, 4 P > >>>>> INFO 2019-06-30T08:46:08.887 [Memory]: Alloc=10MB > TotalAlloc=230195MB Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=49MB > HeapInUse=12MB HeapReleased=5MB > >>>>> gc 41838 @19139.474s 0%: 0.005+2.6+0.003 ms clock, > 0.023+0.37/2.5/4.3+0.014 ms cpu, 11->11->5 MB, 12 MB goal, 4 P > >>>>> gc 41839 @19140.173s 0%: 0.011+2.4+0.003 ms clock, > 0.046+0.20/2.3/5.8+0.015 ms cpu, 11->11->5 MB, 12 MB goal, 4 P > >>>>> INFO 2019-06-30T08:46:09.887 [Memory]: Alloc=7MB TotalAlloc=230202MB > Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=11MB HeapReleased=5MB > >>>>> gc 41840 @19140.831s 0%: 0.082+2.1+0.003 ms clock, > 0.32+0.64/2.1/5.3+0.014 ms cpu, 11->11->5 MB, 12 MB goal, 4 P > >>>>> INFO 2019-06-30T08:46:10.887 [Memory]: Alloc=9MB TotalAlloc=230209MB > Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB > >>>>> gc 41841 @19141.655s 0%: 0.014+2.1+0.003 ms clock, > 0.056+0.28/2.0/5.7+0.013 ms cpu, 11->11->5 MB, 12 MB goal, 4 P > >>>>> gc 41842 @19142.316s 0%: 0.006+2.7+0.003 ms clock, > 0.027+0.29/2.6/6.2+0.014 ms cpu, 11->11->5 MB, 12 MB goal, 4 P > >>>>> INFO 2019-06-30T08:46:11.888 [Memory]: Alloc=6MB TotalAlloc=230216MB > Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=51MB HeapInUse=11MB HeapReleased=5MB > >>>>> gc 41843 @19142.942s 0%: 0.010+2.1+0.005 ms clock, > 0.040+0.29/2.0/5.7+0.023 ms cpu, 11->11->5 MB, 12 MB goal, 4 P > >>>>> INFO 2019-06-30T08:46:12.888 [Memory]: Alloc=9MB TotalAlloc=230223MB > Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=11MB HeapReleased=5MB > >>>>> gc 41844 @19143.724s 0%: 0.008+2.4+0.004 ms clock, > 0.035+0.38/2.0/5.7+0.017 ms cpu, 11->11->5 MB, 12 MB goal, 4 P > >>>>> gc 41845 @19144.380s 0%: 10+9.3+0.044 ms clock, 43+6.1/9.2/4.4+0.17 > ms cpu, 11->11->6 MB, 12 MB goal, 4 P > >>>>> INFO 2019-06-30T08:46:13.901 [Memory]: Alloc=6MB TotalAlloc=230230MB > Sys=136MB RSS=98MB HeapSys=94MB HeapIdle=83MB HeapInUse=11MB > HeapReleased=35MB > >>>>> gc 41846 @19144.447s 0%: 0.008+26+0.005 ms clock, > 0.033+0.46/7.8/26+0.020 ms cpu, 11->12->9 MB, 12 MB goal, 4 P > >>>>> gc 41847 @19144.672s 0%: 0.013+76+0.006 ms clock, > 0.053+0.20/6.4/80+0.024 ms cpu, 17->18->8 MB, 18 MB goal, 4 P > >>>>> gc 41848 @19145.014s 0%: 0.008+172+0.005 ms clock, > 0.035+0.13/8.5/177+0.022 ms cpu, 15->17->10 MB, 16 MB goal, 4 P > >>>>> gc 41849 @19145.298s 0%: 0.007+285+0.006 ms clock, > 0.030+10/285/7.6+0.024 ms cpu, 19->23->15 MB, 20 MB goal, 4 P > >>>>> INFO 2019-06-30T08:46:15.052 [Memory]: Alloc=22MB > TotalAlloc=230264MB Sys=598MB RSS=531MB HeapSys=265MB HeapIdle=240MB > HeapInUse=25MB HeapReleased=164MB > >>>>> gc 41850 @19145.665s 0%: 10+419+0.005 ms clock, > 43+489/158/0.60+0.021 ms cpu, 26->30->17 MB, 30 MB goal, 4 P > >>>>> gc 41851 @19146.325s 0%: 21+798+0.036 ms clock, 86+990/401/0+0.14 ms > cpu, 28->42->30 MB, 34 MB goal, 4 P > >>>>> INFO 2019-06-30T08:46:16.613 [Memory]: Alloc=41MB > TotalAlloc=230303MB Sys=995MB RSS=861MB HeapSys=199MB HeapIdle=155MB > HeapInUse=44MB HeapReleased=54MB > >>>>> > >>>>> I also captured the OOM log from dmesg here > https://gist.github.com/mightyguava/7ecc6fc55f5cd925062d6beede3783b3. > >>>>> > >>>>> -- > >>>>> Yunchi > >>>>> > >>>>> -- > >>>>> You received this message because you are subscribed to the Google > Groups "golang-nuts" group. > >>>>> To unsubscribe from this group and stop receiving emails from it, > send an email to golang-nuts+unsubscr...@googlegroups.com. > >>>>> To view this discussion on the web visit > https://groups.google.com/d/msgid/golang-nuts/CANnT9sj1_sZCKDkGbkzarwcn8DYEX9OS6Ack%2B71613eyLQ7y6w%40mail.gmail.com > . > >>>>> For more options, visit https://groups.google.com/d/optout. > >>>>> > >>>>> > >>>> > >>>> > >>>> -- > >>>> Yunchi > >>>> > >>>> -- > >>>> You received this message because you are subscribed to the Google > Groups "golang-nuts" group. > >>>> To unsubscribe from this group and stop receiving emails from it, > send an email to golang-nuts+unsubscr...@googlegroups.com. > >>>> To view this discussion on the web visit > https://groups.google.com/d/msgid/golang-nuts/CANnT9sjNpE8wjqv6n%2BbHyZJ_cCvwN3O9rHKTT3%3DdSqZah0PfHA%40mail.gmail.com > . > >>>> For more options, visit https://groups.google.com/d/optout. > >> > >> > >> > >> -- > >> Yunchi > > > > > > > > -- > > Yunchi > > > > -- > > You received this message because you are subscribed to the Google > Groups "golang-nuts" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to golang-nuts+unsubscr...@googlegroups.com. > > To view this discussion on the web visit > https://groups.google.com/d/msgid/golang-nuts/CANnT9shkUa04vaL%3D4jY1is1%3Dj2Vov5a%2BX7zSPoQCwcM8_UQpUA%40mail.gmail.com > . > > For more options, visit https://groups.google.com/d/optout. > > > > > -- Yunchi -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/CANnT9sjwz2Es2T9qo55mrGLEjnUiD89jGRST93h5wP_Z%2BMJPbQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.