Re: [go-nuts] OOM occurring with a small heap

'Yunchi Luo' via golang-nuts Mon, 01 Jul 2019 17:37:19 -0700

Following that logic, a leak of TCP connections should manifest as a file
descriptor leak. We have the process_open_fds metric from Prometheus, that
is the number of open file descriptors as found in /proc/<pid>/fd. The
number of descriptors overtime correlates well with the amount of traffic,
pretty cyclic. There doesn't appear to be a leak.


We don't do our own memory management and the binary is compiled with
CGO_ENABLED=0.

I still think the issue I'm seeing should be GC (or heap) related, given
the explosion in mark & sweep time, HeapSys, HeapIdle, and HeapReleased
just before the process dies. But I'm lacking ideas on how to track down
the cause of the increase.

On Mon, Jul 1, 2019 at 5:39 PM Robert Engels <reng...@ix.netcom.com> wrote:

> I think don't think you are going to find it in the 'heap', rather it
> would be in native memory.
>
> I would use the monitor the /proc/[pid] for the process, and pay attention
> to the 'fd','net' and 'statm' - if my theory is correct you will see growth
> here long before the process is killed. Since you are running under k8s and
> cgroups, you will need to do this along side the Go process (unless you
> have root access to the server).
>
> I 'think' depending on kernel version, that kernel memory used goes
> against the process for OOM purposes, so this is a likely candidate if
> pprof is showing nothing.
>
> Do you by chance do any of your own memory management (via malloc/CGO)? If
> so, this is not going to show in pprof either.
>
> -----Original Message-----
> From: 'Yunchi Luo' via golang-nuts
> Sent: Jul 1, 2019 4:26 PM
> To: Robert Engels
> Cc: golang-nuts@googlegroups.com, Alec Thomas
> Subject: Re: [go-nuts] OOM occurring with a small heap
>
> I actually have a heap profile (pasted at the bottom) from about 1 second
> before the service died (the goroutine that is logging "[Memory]" triggers
> heap profiles once RSS > 100MB). I don't see TCP connections there. Maybe
> it's too few to be sampled. How would I verify your theory? That the
> service dies within 2 seconds after several hours makes it very hard to
> debug.
>
> The top thing in the heap profile is from the reflect package. I initially
> found that suspect, but it turns out this comes from a use of
> httptrace.ClientTrace I had for counting new connections to DynamoDB.
>
>  tracer := &httptrace.ClientTrace{
>    ConnectStart: func(_, _ string) {
>      newConns.Inc()
>    },
> }
>
> `newConns` is just a prometheus counter. The `tracer` object itself is
> created once and re-used with every client request context. On request,
> `httptrace.WithClientTrace(ctx, tracer)` uses reflection to compose the
> trace functions under-the-hood and uses reflection to invoke it, hence the
> reflect.funcLayout and reflect.Value.call. Objects in `reflect` account for
> about 50% of heap in terms of size, and does seem to grow as the service is
> running out of memory, but that's only 12MB so I thought it was a red
> herring.
>
> Heap profile:
> Type: inuse_space
> Time: Jun 30, 2019 at 4:46am (EDT)
> Entering interactive mode (type "help" for commands, "o" for options)
> (pprof) inuse_objects
> (pprof) top
> Showing nodes accounting for 414485, 100% of 414485 total
> Showing top 10 nodes out of 81
>       flat  flat%   sum%        cum   cum%
>     344074 83.01% 83.01%     344074 83.01%  reflect.funcLayout.func1
>      32768  7.91% 90.92%     376842 90.92%  reflect.callReflect
>      16384  3.95% 94.87%      16384  3.95%
> github.com/json-iterator/go.processTags
>      10923  2.64% 97.51%      10923  2.64%  context.WithValue
>       8192  1.98% 99.48%       8192  1.98%  crypto/hmac.New
>       1260   0.3% 99.79%       1260   0.3%
> github.com/aws/aws-sdk-go/aws/signer/v4.(*signingCtx).buildCanonicalHeaders
>        820   0.2%   100%        820   0.2%
> github.com/stripe/veneur/tdigest.NewMerging
>         64 0.015%   100%         64 0.015%  reflect.addReflectOff
>          0     0%   100%        820   0.2%
> git.sqcorp.co/cash/digester/lib/pkg/histogram.(*SlidingWindowDigest).Observe
>          0     0%   100%        820   0.2%
> git.sqcorp.co/cash/digester/lib/pkg/histogram.(*SlidingWindowDigest).openDigests
> (pprof) cum
> (pprof) top
> Showing nodes accounting for 376842, 90.92% of 414485 total
> Showing top 10 nodes out of 81
>       flat  flat%   sum%        cum   cum%
>          0     0%     0%     376842 90.92%
>  net/http/httptrace.(*ClientTrace).compose.func1
>          0     0%     0%     376842 90.92%  reflect.Value.Call
>          0     0%     0%     376842 90.92%  reflect.Value.call
>      32768  7.91%  7.91%     376842 90.92%  reflect.callReflect
>          0     0%  7.91%     376842 90.92%  reflect.makeFuncStub
>     344074 83.01% 90.92%     344074 83.01%  reflect.funcLayout.func1
>          0     0% 90.92%     344074 83.01%  sync.(*Pool).Get
>          0     0% 90.92%      16448  3.97%
> github.com/json-iterator/go._createDecoderOfType
>          0     0% 90.92%      16448  3.97%
> github.com/json-iterator/go.createDecoderOfType
>          0     0% 90.92%      16448  3.97%
> github.com/json-iterator/go.decoderOfStruct
>
>
> On Mon, Jul 1, 2019 at 4:32 PM Robert Engels <reng...@ix.netcom.com>
> wrote:
>
>>
>> A leak of the TCP connections (maybe not properly closed)? Each TCP
>> connection will use kernel memory and process memory (local buffers), that
>> won't be on the heap (the reference to the TCP connection will be in the Go
>> heap, but is probably much smaller than the buffer allocation).
>>
>> That would be my guess - but just a guess.
>>
>> -----Original Message-----
>> From: 'Yunchi Luo' via golang-nuts
>> Sent: Jul 1, 2019 2:14 PM
>> To: golang-nuts@googlegroups.com
>> Cc: Alec Thomas
>> Subject: [go-nuts] OOM occurring with a small heap
>>
>> Hello, I'd like to solicit some help with a weird GC issue we are seeing.
>>
>> I'm trying to debug OOM on a service we are running in k8s. The service
>> is just a CRUD server hitting a database (DynamoDB). Each replica serves
>> about 300 qps of traffic. There are no memory leaks. On occasion (seemingly
>> correlated to small latency spikes on the backend), the service would OOM.
>> This is surprising because it has a circuit breaker that drops requests
>> after 200 concurrent connections that has never trips, and goroutine
>> profiles confirm that there are nowhere 200 active goroutines.
>>
>> GC logs are pasted below. It's interlaced with dumps of runtime.Memstats
>> (the RSS number is coming from /proc/<pid>/stats). Go version is 1.12.5,
>> running an Alpine 3.10 container in an Amazon kernel
>> 4.14.123-111.109.amzn2.x86_64.
>>
>> The service happily serves requests using ~50MB of RSS for hours, until
>> the last 2 seconds, where GC mark&sweep time starts to 2-4X per cycle 
>> (43+489/158/0.60+0.021
>> ms cpu => 43+489/158/0.60+0.021 ms cpu), and RSS and Sys blow up. It’s
>> also interesting that in the last log line: `Sys=995MB RSS=861MB
>> HeapSys=199MB`. If I’m reading this correctly, there’s at least `662MB` of
>> memory in RSS that is not assigned to the heap. Though this might be due to
>> the change in 1.125 to use MADV_FREE, so the pages are freeable not yet
>> reclaimed by the kernel.
>>
>> I don’t understand how heap can be so small across gc cycles
>> (28->42->30MB on the last line means heap doesn't grow past 42MB?), yet RSS
>> keeps growing. I'm assuming the increased RSS is causing the kernel to OOM
>> the service, but that should only happen if the RSS is not freeable as
>> marked by MADV_FREE. There doesn't seem to be any indication of that from
>> the GC logs. I guess this all comes down to me not having a good
>> understanding of how the GC algorithm works and how to read these logs. I'd
>> really appreciate it if anyone can explain what's happening and why.
>>
>> gc 41833 @19135.227s 0%: 0.019+2.3+0.005 ms clock,
>> 0.079+0.29/2.2/5.6+0.020 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> INFO 2019-06-30T08:46:04.886 [Memory]: Alloc=7MB TotalAlloc=230172MB
>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=51MB HeapInUse=11MB HeapReleased=5MB
>> gc 41834 @19135.869s 0%: 0.005+2.9+0.003 ms clock,
>> 0.023+0.32/2.5/6.6+0.012 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> INFO 2019-06-30T08:46:05.886 [Memory]: Alloc=9MB TotalAlloc=230179MB
>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB
>> gc 41835 @19136.704s 0%: 0.038+2.1+0.004 ms clock,
>> 0.15+0.35/2.1/5.3+0.016 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> INFO 2019-06-30T08:46:06.886 [Memory]: Alloc=9MB TotalAlloc=230184MB
>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB
>> gc 41836 @19137.611s 0%: 0.009+2.1+0.003 ms clock,
>> 0.036+0.39/2.0/5.7+0.015 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> INFO 2019-06-30T08:46:07.887 [Memory]: Alloc=10MB TotalAlloc=230190MB
>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=49MB HeapInUse=12MB HeapReleased=5MB
>> gc 41837 @19138.444s 0%: 0.008+2.1+0.004 ms clock,
>> 0.035+0.51/2.1/5.7+0.017 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> INFO 2019-06-30T08:46:08.887 [Memory]: Alloc=10MB TotalAlloc=230195MB
>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=49MB HeapInUse=12MB HeapReleased=5MB
>> gc 41838 @19139.474s 0%: 0.005+2.6+0.003 ms clock,
>> 0.023+0.37/2.5/4.3+0.014 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> gc 41839 @19140.173s 0%: 0.011+2.4+0.003 ms clock,
>> 0.046+0.20/2.3/5.8+0.015 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> INFO 2019-06-30T08:46:09.887 [Memory]: Alloc=7MB TotalAlloc=230202MB
>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=11MB HeapReleased=5MB
>> gc 41840 @19140.831s 0%: 0.082+2.1+0.003 ms clock,
>> 0.32+0.64/2.1/5.3+0.014 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> INFO 2019-06-30T08:46:10.887 [Memory]: Alloc=9MB TotalAlloc=230209MB
>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB
>> gc 41841 @19141.655s 0%: 0.014+2.1+0.003 ms clock,
>> 0.056+0.28/2.0/5.7+0.013 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> gc 41842 @19142.316s 0%: 0.006+2.7+0.003 ms clock,
>> 0.027+0.29/2.6/6.2+0.014 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> INFO 2019-06-30T08:46:11.888 [Memory]: Alloc=6MB TotalAlloc=230216MB
>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=51MB HeapInUse=11MB HeapReleased=5MB
>> gc 41843 @19142.942s 0%: 0.010+2.1+0.005 ms clock,
>> 0.040+0.29/2.0/5.7+0.023 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> INFO 2019-06-30T08:46:12.888 [Memory]: Alloc=9MB TotalAlloc=230223MB
>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=11MB HeapReleased=5MB
>> gc 41844 @19143.724s 0%: 0.008+2.4+0.004 ms clock,
>> 0.035+0.38/2.0/5.7+0.017 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> gc 41845 @19144.380s 0%: 10+9.3+0.044 ms clock, 43+6.1/9.2/4.4+0.17 ms
>> cpu, 11->11->6 MB, 12 MB goal, 4 P
>> INFO 2019-06-30T08:46:13.901 [Memory]: Alloc=6MB TotalAlloc=230230MB
>> Sys=136MB RSS=98MB HeapSys=94MB HeapIdle=83MB HeapInUse=11MB
>> HeapReleased=35MB
>> gc 41846 @19144.447s 0%: 0.008+26+0.005 ms clock, 0.033+0.46/7.8/26+0.020
>> ms cpu, 11->12->9 MB, 12 MB goal, 4 P
>> gc 41847 @19144.672s 0%: 0.013+76+0.006 ms clock, 0.053+0.20/6.4/80+0.024
>> ms cpu, 17->18->8 MB, 18 MB goal, 4 P
>> gc 41848 @19145.014s 0%: 0.008+172+0.005 ms clock,
>> 0.035+0.13/8.5/177+0.022 ms cpu, 15->17->10 MB, 16 MB goal, 4 P
>> gc 41849 @19145.298s 0%: 0.007+285+0.006 ms clock, 0.030+10/285/7.6+0.024
>> ms cpu, 19->23->15 MB, 20 MB goal, 4 P
>> INFO 2019-06-30T08:46:15.052 [Memory]: Alloc=22MB TotalAlloc=230264MB
>> Sys=598MB RSS=531MB HeapSys=265MB HeapIdle=240MB HeapInUse=25MB
>> HeapReleased=164MB
>> gc 41850 @19145.665s 0%: 10+419+0.005 ms clock, 43+489/158/0.60+0.021 ms
>> cpu, 26->30->17 MB, 30 MB goal, 4 P
>> gc 41851 @19146.325s 0%: 21+798+0.036 ms clock, 86+990/401/0+0.14 ms cpu,
>> 28->42->30 MB, 34 MB goal, 4 P
>> INFO 2019-06-30T08:46:16.613 [Memory]: Alloc=41MB TotalAlloc=230303MB
>> Sys=995MB RSS=861MB HeapSys=199MB HeapIdle=155MB HeapInUse=44MB
>> HeapReleased=54MB
>>
>> I also captured the OOM log from dmesg here
>> https://gist.github.com/mightyguava/7ecc6fc55f5cd925062d6beede3783b3.
>>
>> --
>> Yunchi
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "golang-nuts" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to golang-nuts+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/golang-nuts/CANnT9sj1_sZCKDkGbkzarwcn8DYEX9OS6Ack%2B71613eyLQ7y6w%40mail.gmail.com
>> <https://groups.google.com/d/msgid/golang-nuts/CANnT9sj1_sZCKDkGbkzarwcn8DYEX9OS6Ack%2B71613eyLQ7y6w%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>>
>>
>
> --
> Yunchi
>
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/golang-nuts/CANnT9siMD4L5oC0nvmGYYM3Qo4rVSSCT%2BK5__fR9%3DzuyXV6S0Q%40mail.gmail.com
> <https://groups.google.com/d/msgid/golang-nuts/CANnT9siMD4L5oC0nvmGYYM3Qo4rVSSCT%2BK5__fR9%3DzuyXV6S0Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
>

-- 
Yunchi

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CANnT9sjEo%2Bp2HYOVDjB_EzaGE474QuBy-9Yp_HaCn95wHSFsHQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [go-nuts] OOM occurring with a small heap

Reply via email to