ok, this is interesting:

reflect.MakeFunc: i've never done this before. what are the allocation
patterns for creating functions with reflect? i see a few crashes
related to these functions but no mentioning of severe memory
consumption.

in my opinion, trying to capture MakeFunc patterns from your original
program in a small example will be helpful.

On Tue, Jul 2, 2019 at 1:28 PM Bakul Shah <ba...@bitblocks.com> wrote:
>
> Switching Go version seems like a stab in the dark. If the OOM symptom does 
> show up, you have simply wasted time. If it doesn't show up, you still don't 
> know if the bug exists and is simply hiding. Even if you think the bug in Go 
> code generation (or GC) and not in your code, there is nothing the Go 
> developers can do without a concrete test. So then you have to keep changing 
> things until the symptom disappears for a long time and cross fingers the bug 
> went away....
>
> If the symptom disappears when you take out httptrace, now you have narrowed 
> the hypothesis to httptrace but you haven't proven it. You still don't know 
> if removing httptrace changed the timing behavior such that the bug is hiding 
> somewhere and if it will show up in future.
>
> This is why one should change as little as possible outside of what is needed 
> for testing a specific hypothesis and why you should have a test that tickles 
> the bug. By "proving" I mean you must find a specific assertion that is 
> violated.
>
> I'd instrument the code under question and somehow capture the history of 
> last N seconds just before OOM. Capturing heap profile is just one way to 
> look at what your code does. You can find other ways to look at what your 
> code does and create a log. For instance, the extra memory use is also a 
> symptom just like OOM; what you may want to check is whether your data 
> structures are consistent. The actual inconsistency may have occurred long 
> before the OOM crash.
>
> Note that the GC logs allow you to see some aspect of the GC behavior better 
> but if you don't understand it well enough, it may seem mysterious (compared 
> to your mental model) and later, when you run out of other hypotheses, it may 
> even seem suspicious, so I'd be careful about looking at such things :-)
>
> On Jul 2, 2019, at 11:16 AM, 'Yunchi Luo' via golang-nuts 
> <golang-nuts@googlegroups.com> wrote:
>
> I removed the httptrace call yesterday and there have been no OOMs yet. Going 
> to let it bake for another day. If OOMs show up again, I'll try reverting to 
> an older Go version tomorrow. Otherwise I'll point my finger at httptrace I 
> guess.
>
> On Tue, Jul 2, 2019 at 2:15 PM Yunchi Luo <yunchi...@squareup.com> wrote:
>>
>> I did try to do that! I have 3 heap profiles captured from the ~3 seconds 
>> before crash. The only thing particularly suspicious is the httptrace call I 
>> mentioned earlier in the thread.
>>
>> Diffing 1 to 2
>> (pprof) cum
>> (pprof) top 50
>> Showing nodes accounting for 4604.15kB, 81.69% of 5636.17kB total
>>       flat  flat%   sum%        cum   cum%
>>          0     0%     0%  5120.16kB 90.84%  
>> net/http/httptrace.(*ClientTrace).compose.func1
>>          0     0%     0%  5120.16kB 90.84%  reflect.Value.Call
>>          0     0%     0%  5120.16kB 90.84%  reflect.Value.call
>>          0     0%     0%  5120.16kB 90.84%  reflect.callReflect
>>  5120.16kB 90.84% 90.84%  5120.16kB 90.84%  reflect.funcLayout.func1
>>          0     0% 90.84%  5120.16kB 90.84%  reflect.makeFuncStub
>>          0     0% 90.84%  4604.15kB 81.69%  sync.(*Pool).Get
>>          0     0% 90.84%  -516.01kB  9.16%  io.Copy
>>          0     0% 90.84%  -516.01kB  9.16%  io.copyBuffer
>>          0     0% 90.84%  -516.01kB  9.16%  io/ioutil.devNull.ReadFrom
>>  -516.01kB  9.16% 81.69%  -516.01kB  9.16%  io/ioutil.glob..func1
>>
>> Diff 2 to 3
>> (pprof) top 50
>> Showing nodes accounting for 7680.44kB, 100% of 7680.44kB total
>>       flat  flat%   sum%        cum   cum%
>>          0     0%     0%  6144.18kB 80.00%  
>> net/http/httptrace.(*ClientTrace).compose.func1
>>          0     0%     0%  6144.18kB 80.00%  reflect.Value.Call
>>          0     0%     0%  6144.18kB 80.00%  reflect.Value.call
>>   512.01kB  6.67%  6.67%  6144.18kB 80.00%  reflect.callReflect
>>          0     0%  6.67%  6144.18kB 80.00%  reflect.makeFuncStub
>>  5632.17kB 73.33% 80.00%  5632.17kB 73.33%  reflect.funcLayout.func1
>>          0     0% 80.00%  5632.17kB 73.33%  sync.(*Pool).Get
>>          0     0% 80.00%  1024.23kB 13.34%  
>> github.com/aws/aws-sdk-go/aws/request.(*HandlerList).Run
>>          0     0% 80.00%  1024.23kB 13.34%  
>> github.com/aws/aws-sdk-go/aws/request.(*Request).Send
>>          0     0% 80.00%  1024.23kB 13.34%  
>> github.com/aws/aws-sdk-go/aws/request.(*Request).Sign
>>          0     0% 80.00%  1024.23kB 13.34%  
>> github.com/aws/aws-sdk-go/aws/signer/v4.(*signingCtx).build
>>          0     0% 80.00%  1024.23kB 13.34%  
>> github.com/aws/aws-sdk-go/aws/signer/v4.SignSDKRequest
>>          0     0% 80.00%  1024.23kB 13.34%  
>> github.com/aws/aws-sdk-go/aws/signer/v4.SignSDKRequestWithCurrentTime
>>          0     0% 80.00%  1024.23kB 13.34%  
>> github.com/aws/aws-sdk-go/aws/signer/v4.Signer.signWithBody
>>          0     0% 80.00%  1024.23kB 13.34%  
>> github.com/aws/aws-sdk-go/service/dynamodb.(*DynamoDB).GetItemWithContext
>>
>> On Tue, Jul 2, 2019 at 2:08 PM andrey mirtchovski <mirtchov...@gmail.com> 
>> wrote:
>>>
>>> What I have found useful in the past is pprof's ability to diff profiles. 
>>> That means that if you capture heap profiles at regular intervals you can 
>>> see a much smaller subset of changes and compare allocation patterns.
>>>
>>> On Tue, Jul 2, 2019, 10:53 AM 'Yunchi Luo' via golang-nuts 
>>> <golang-nuts@googlegroups.com> wrote:
>>>>
>>>> I'm not so much pointing my finger at GC as I am hoping GC logs could help 
>>>> tell the story, and that someone with a strong understanding of GC in Go 
>>>> could weigh in here. In the last 4 seconds before OOM, "TotalAlloc" 
>>>> increased by only 80M, yet "HeapIdle" increased to 240M from 50M, RSS 
>>>> increased by 810M. The numbers don't add up for me. A running sum of 80M 
>>>> of heap objects were allocated in the time RSS increased by 10X that. If 
>>>> GC was completely off, I still wouldn't expect this to happen, which makes 
>>>> me want to rule out GC being blocked as a problem. Maybe there was runaway 
>>>> heap reservation because something in my code caused a ton of 
>>>> fragmentation? Is that sane? The heap profile lacking clues is also 
>>>> strange.
>>>>
>>>> Regarding the possibility of a race, I forgot I do have goroutine profiles 
>>>> captured along with the heap profiles at the time memory exploded. There 
>>>> are only 10 goroutines running on the serving path, which rules out too 
>>>> many concurrent requests being served (please correct me if I'm wrong). 
>>>> Those fan out to 13 goroutines talking to the db, all of which are in 
>>>> http.Transport.roundTrip (which is blocked on runtime.gopark so I assume 
>>>> they are waiting on the TCP connection). All other goroutines that don't 
>>>> originate in the stdlib are also blocked on `select` or sleeping. Our CI 
>>>> does run with go test -race, but I'll try doing some load testing with a 
>>>> race detector enabled binary.
>>>>
>>>> Bakul, that is sound advice. I've actually been debugging this on and off 
>>>> for a couple months now, with the help of several people, a few of which 
>>>> have peer reviewed the code. I agree it's most likely to be some runaway 
>>>> code that I caused in my logic, but we haven't been able to pin-point the 
>>>> cause and I've run out of hypothesis to test at the moment. This is why 
>>>> I've started asking on go-nuts@. The circuit breaker code was one of the 
>>>> first things I checked, has been unit tested and verified working with 
>>>> load tests. Now that you mention it, I actually did uncover a Go stdlib 
>>>> bug in http2 while doing the load tests... but that's unrelated.
>>>>
>>>>
>>>> On Tue, Jul 2, 2019 at 2:24 AM Bakul Shah <ba...@bitblocks.com> wrote:
>>>>>
>>>>> Before assuming it is the GC or something system related, you may wish to 
>>>>> verify it is *not your own logic*. Larger RSS could also be due to your 
>>>>> own logic touching more and more memory due to some runaway effect. The 
>>>>> probability this has to do with GC is very low given the very widespread 
>>>>> use of Go and the probability of a bug in new code is very high given it 
>>>>> is used on a much much smaller scale.
>>>>>
>>>>> This has the "smell" of a concurrency bug. If I were you I'd test the 
>>>>> code for any races, I'd review the code thoroughly with someone who 
>>>>> doesn't know the code so that I'm forced to explain it, and I'd add 
>>>>> plenty of assertions. I'd probably first look at the circuit breaker code 
>>>>> -- things like how does it know how many concurrent connections exist?
>>>>>
>>>>> In general, any hypothesis you come up with, you should have a test that 
>>>>> *catches* the bug given the hypothesis. Elusive bugs tend to become more 
>>>>> elusive as you are on the hunt and as you may fix other problems you 
>>>>> discover on the way.
>>>>>
>>>>> I even suspect you're looking at GC logs a bit too early. Instrument your 
>>>>> own code and look at what patterns emerge. [Not to mention any time you 
>>>>> spend on understanding your code will help improve your service; but 
>>>>> better understanding of and debugging the GC won't necessarily help you!]
>>>>>
>>>>> On Jul 1, 2019, at 12:14 PM, 'Yunchi Luo' via golang-nuts 
>>>>> <golang-nuts@googlegroups.com> wrote:
>>>>>
>>>>> Hello, I'd like to solicit some help with a weird GC issue we are seeing.
>>>>>
>>>>> I'm trying to debug OOM on a service we are running in k8s. The service 
>>>>> is just a CRUD server hitting a database (DynamoDB). Each replica serves 
>>>>> about 300 qps of traffic. There are no memory leaks. On occasion 
>>>>> (seemingly correlated to small latency spikes on the backend), the 
>>>>> service would OOM. This is surprising because it has a circuit breaker 
>>>>> that drops requests after 200 concurrent connections that has never 
>>>>> trips, and goroutine profiles confirm that there are nowhere 200 active 
>>>>> goroutines.
>>>>>
>>>>> GC logs are pasted below. It's interlaced with dumps of runtime.Memstats 
>>>>> (the RSS number is coming from /proc/<pid>/stats). Go version is 1.12.5, 
>>>>> running an Alpine 3.10 container in an Amazon kernel 
>>>>> 4.14.123-111.109.amzn2.x86_64.
>>>>>
>>>>> The service happily serves requests using ~50MB of RSS for hours, until 
>>>>> the last 2 seconds, where GC mark&sweep time starts to 2-4X per cycle 
>>>>> (43+489/158/0.60+0.021 ms cpu => 43+489/158/0.60+0.021 ms cpu), and RSS 
>>>>> and Sys blow up. It’s also interesting that in the last log line: 
>>>>> `Sys=995MB RSS=861MB HeapSys=199MB`. If I’m reading this correctly, 
>>>>> there’s at least `662MB` of memory in RSS that is not assigned to the 
>>>>> heap. Though this might be due to the change in 1.125 to use MADV_FREE, 
>>>>> so the pages are freeable not yet reclaimed by the kernel.
>>>>>
>>>>> I don’t understand how heap can be so small across gc cycles 
>>>>> (28->42->30MB on the last line means heap doesn't grow past 42MB?), yet 
>>>>> RSS keeps growing. I'm assuming the increased RSS is causing the kernel 
>>>>> to OOM the service, but that should only happen if the RSS is not 
>>>>> freeable as marked by MADV_FREE. There doesn't seem to be any indication 
>>>>> of that from the GC logs. I guess this all comes down to me not having a 
>>>>> good understanding of how the GC algorithm works and how to read these 
>>>>> logs. I'd really appreciate it if anyone can explain what's happening and 
>>>>> why.
>>>>>
>>>>> gc 41833 @19135.227s 0%: 0.019+2.3+0.005 ms clock, 
>>>>> 0.079+0.29/2.2/5.6+0.020 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>>>> INFO 2019-06-30T08:46:04.886 [Memory]: Alloc=7MB TotalAlloc=230172MB 
>>>>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=51MB HeapInUse=11MB 
>>>>> HeapReleased=5MB
>>>>> gc 41834 @19135.869s 0%: 0.005+2.9+0.003 ms clock, 
>>>>> 0.023+0.32/2.5/6.6+0.012 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>>>> INFO 2019-06-30T08:46:05.886 [Memory]: Alloc=9MB TotalAlloc=230179MB 
>>>>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB 
>>>>> HeapReleased=5MB
>>>>> gc 41835 @19136.704s 0%: 0.038+2.1+0.004 ms clock, 
>>>>> 0.15+0.35/2.1/5.3+0.016 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>>>> INFO 2019-06-30T08:46:06.886 [Memory]: Alloc=9MB TotalAlloc=230184MB 
>>>>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB 
>>>>> HeapReleased=5MB
>>>>> gc 41836 @19137.611s 0%: 0.009+2.1+0.003 ms clock, 
>>>>> 0.036+0.39/2.0/5.7+0.015 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>>>> INFO 2019-06-30T08:46:07.887 [Memory]: Alloc=10MB TotalAlloc=230190MB 
>>>>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=49MB HeapInUse=12MB 
>>>>> HeapReleased=5MB
>>>>> gc 41837 @19138.444s 0%: 0.008+2.1+0.004 ms clock, 
>>>>> 0.035+0.51/2.1/5.7+0.017 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>>>> INFO 2019-06-30T08:46:08.887 [Memory]: Alloc=10MB TotalAlloc=230195MB 
>>>>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=49MB HeapInUse=12MB 
>>>>> HeapReleased=5MB
>>>>> gc 41838 @19139.474s 0%: 0.005+2.6+0.003 ms clock, 
>>>>> 0.023+0.37/2.5/4.3+0.014 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>>>> gc 41839 @19140.173s 0%: 0.011+2.4+0.003 ms clock, 
>>>>> 0.046+0.20/2.3/5.8+0.015 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>>>> INFO 2019-06-30T08:46:09.887 [Memory]: Alloc=7MB TotalAlloc=230202MB 
>>>>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=11MB 
>>>>> HeapReleased=5MB
>>>>> gc 41840 @19140.831s 0%: 0.082+2.1+0.003 ms clock, 
>>>>> 0.32+0.64/2.1/5.3+0.014 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>>>> INFO 2019-06-30T08:46:10.887 [Memory]: Alloc=9MB TotalAlloc=230209MB 
>>>>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB 
>>>>> HeapReleased=5MB
>>>>> gc 41841 @19141.655s 0%: 0.014+2.1+0.003 ms clock, 
>>>>> 0.056+0.28/2.0/5.7+0.013 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>>>> gc 41842 @19142.316s 0%: 0.006+2.7+0.003 ms clock, 
>>>>> 0.027+0.29/2.6/6.2+0.014 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>>>> INFO 2019-06-30T08:46:11.888 [Memory]: Alloc=6MB TotalAlloc=230216MB 
>>>>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=51MB HeapInUse=11MB 
>>>>> HeapReleased=5MB
>>>>> gc 41843 @19142.942s 0%: 0.010+2.1+0.005 ms clock, 
>>>>> 0.040+0.29/2.0/5.7+0.023 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>>>> INFO 2019-06-30T08:46:12.888 [Memory]: Alloc=9MB TotalAlloc=230223MB 
>>>>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=11MB 
>>>>> HeapReleased=5MB
>>>>> gc 41844 @19143.724s 0%: 0.008+2.4+0.004 ms clock, 
>>>>> 0.035+0.38/2.0/5.7+0.017 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>>>> gc 41845 @19144.380s 0%: 10+9.3+0.044 ms clock, 43+6.1/9.2/4.4+0.17 ms 
>>>>> cpu, 11->11->6 MB, 12 MB goal, 4 P
>>>>> INFO 2019-06-30T08:46:13.901 [Memory]: Alloc=6MB TotalAlloc=230230MB 
>>>>> Sys=136MB RSS=98MB HeapSys=94MB HeapIdle=83MB HeapInUse=11MB 
>>>>> HeapReleased=35MB
>>>>> gc 41846 @19144.447s 0%: 0.008+26+0.005 ms clock, 0.033+0.46/7.8/26+0.020 
>>>>> ms cpu, 11->12->9 MB, 12 MB goal, 4 P
>>>>> gc 41847 @19144.672s 0%: 0.013+76+0.006 ms clock, 0.053+0.20/6.4/80+0.024 
>>>>> ms cpu, 17->18->8 MB, 18 MB goal, 4 P
>>>>> gc 41848 @19145.014s 0%: 0.008+172+0.005 ms clock, 
>>>>> 0.035+0.13/8.5/177+0.022 ms cpu, 15->17->10 MB, 16 MB goal, 4 P
>>>>> gc 41849 @19145.298s 0%: 0.007+285+0.006 ms clock, 0.030+10/285/7.6+0.024 
>>>>> ms cpu, 19->23->15 MB, 20 MB goal, 4 P
>>>>> INFO 2019-06-30T08:46:15.052 [Memory]: Alloc=22MB TotalAlloc=230264MB 
>>>>> Sys=598MB RSS=531MB HeapSys=265MB HeapIdle=240MB HeapInUse=25MB 
>>>>> HeapReleased=164MB
>>>>> gc 41850 @19145.665s 0%: 10+419+0.005 ms clock, 43+489/158/0.60+0.021 ms 
>>>>> cpu, 26->30->17 MB, 30 MB goal, 4 P
>>>>> gc 41851 @19146.325s 0%: 21+798+0.036 ms clock, 86+990/401/0+0.14 ms cpu, 
>>>>> 28->42->30 MB, 34 MB goal, 4 P
>>>>> INFO 2019-06-30T08:46:16.613 [Memory]: Alloc=41MB TotalAlloc=230303MB 
>>>>> Sys=995MB RSS=861MB HeapSys=199MB HeapIdle=155MB HeapInUse=44MB 
>>>>> HeapReleased=54MB
>>>>>
>>>>> I also captured the OOM log from dmesg here 
>>>>> https://gist.github.com/mightyguava/7ecc6fc55f5cd925062d6beede3783b3.
>>>>>
>>>>> --
>>>>> Yunchi
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google Groups 
>>>>> "golang-nuts" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>>>> email to golang-nuts+unsubscr...@googlegroups.com.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/golang-nuts/CANnT9sj1_sZCKDkGbkzarwcn8DYEX9OS6Ack%2B71613eyLQ7y6w%40mail.gmail.com.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Yunchi
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google Groups 
>>>> "golang-nuts" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>>> email to golang-nuts+unsubscr...@googlegroups.com.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/golang-nuts/CANnT9sjNpE8wjqv6n%2BbHyZJ_cCvwN3O9rHKTT3%3DdSqZah0PfHA%40mail.gmail.com.
>>>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>> --
>> Yunchi
>
>
>
> --
> Yunchi
>
> --
> You received this message because you are subscribed to the Google Groups 
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to golang-nuts+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/golang-nuts/CANnT9shkUa04vaL%3D4jY1is1%3Dj2Vov5a%2BX7zSPoQCwcM8_UQpUA%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CAK4xykXgxUHG3P1AxAKN5mqZu1aimVqC-0FK_%2ByJexrxAVqb%3Dw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to