I don't think GC is being blocked. GC ran multiple times during the time the service was exploding in memory.
I'm not using mmap in the service. It's just a straightforward CRUD webserver backed by DynamoDB. The only thing non-standard I can think of is that it connects to its backing database over HTTP/1.1 (using AWS Go SDK) instead of a custom database protocol. It's serving a fair amount of requests (500 qps per replica), so I initially assumed it was under-provisioned to handle latency spikes. But I tripled the memory and it's still OOMing on occasion, even though RSS is 9% of allocated memory just seconds before. Yunchi On Jul 2, 2019, 12:00 AM -0400, Robert Engels <reng...@ix.netcom.com>, wrote: > Does your process use mmap? Maybe you are leaking there, as this counts > against process memory size. > > On Jul 1, 2019, at 9:11 PM, Mighty Guava <mightygu...@gmail.com> wrote: > > > I don't understand. What would adding runtime.Goscheduled() do here? I > > don't have any explicit loops in this service. > > On Jul 1, 2019, 9:11 PM -0400, Michael Jones <michael.jo...@gmail.com>, > > wrote: > > > Does adding runtime.GoSched() calls make any difference? > > > > > > > On Mon, Jul 1, 2019 at 5:37 PM 'Yunchi Luo' via golang-nuts > > > > <golang-nuts@googlegroups.com> wrote: > > > > > Following that logic, a leak of TCP connections should manifest as a > > > > > file descriptor leak. We have the process_open_fds metric from > > > > > Prometheus, that is the number of open file descriptors as found in > > > > > /proc/<pid>/fd. The number of descriptors overtime correlates well > > > > > with the amount of traffic, pretty cyclic. There doesn't appear to be > > > > > a leak. > > > > > > > > > > We don't do our own memory management and the binary is compiled with > > > > > CGO_ENABLED=0. > > > > > > > > > > I still think the issue I'm seeing should be GC (or heap) related, > > > > > given the explosion in mark & sweep time, HeapSys, HeapIdle, and > > > > > HeapReleased just before the process dies. But I'm lacking ideas on > > > > > how to track down the cause of the increase. > > > > > > > > > > > On Mon, Jul 1, 2019 at 5:39 PM Robert Engels > > > > > > <reng...@ix.netcom.com> wrote: > > > > > > > I think don't think you are going to find it in the 'heap', > > > > > > > rather it would be in native memory. > > > > > > > > > > > > > > I would use the monitor the /proc/[pid] for the process, and pay > > > > > > > attention to the 'fd','net' and 'statm' - if my theory is correct > > > > > > > you will see growth here long before the process is killed. Since > > > > > > > you are running under k8s and cgroups, you will need to do this > > > > > > > along side the Go process (unless you have root access to the > > > > > > > server). > > > > > > > > > > > > > > I 'think' depending on kernel version, that kernel memory used > > > > > > > goes against the process for OOM purposes, so this is a likely > > > > > > > candidate if pprof is showing nothing. > > > > > > > > > > > > > > Do you by chance do any of your own memory management (via > > > > > > > malloc/CGO)? If so, this is not going to show in pprof either. > > > > > > > > -----Original Message----- > > > > > > > > From: 'Yunchi Luo' via golang-nuts > > > > > > > > Sent: Jul 1, 2019 4:26 PM > > > > > > > > To: Robert Engels > > > > > > > > Cc: golang-nuts@googlegroups.com, Alec Thomas > > > > > > > > Subject: Re: [go-nuts] OOM occurring with a small heap > > > > > > > > > > > > > > > > I actually have a heap profile (pasted at the bottom) from > > > > > > > > about 1 second before the service died (the goroutine that is > > > > > > > > logging "[Memory]" triggers heap profiles once RSS > 100MB). I > > > > > > > > don't see TCP connections there. Maybe it's too few to be > > > > > > > > sampled. How would I verify your theory? That the service dies > > > > > > > > within 2 seconds after several hours makes it very hard to > > > > > > > > debug. > > > > > > > > > > > > > > > > The top thing in the heap profile is from the reflect package. > > > > > > > > I initially found that suspect, but it turns out this comes > > > > > > > > from a use of httptrace.ClientTrace I had for counting new > > > > > > > > connections to DynamoDB. > > > > > > > > > > > > > > > > tracer := &httptrace.ClientTrace{ > > > > > > > > ConnectStart: func(_, _ string) { > > > > > > > > newConns.Inc() > > > > > > > > }, > > > > > > > > } > > > > > > > > > > > > > > > > `newConns` is just a prometheus counter. The `tracer` object > > > > > > > > itself is created once and re-used with every client request > > > > > > > > context. On request, `httptrace.WithClientTrace(ctx, tracer)` > > > > > > > > uses reflection to compose the trace functions under-the-hood > > > > > > > > and uses reflection to invoke it, hence the reflect.funcLayout > > > > > > > > and reflect.Value.call. Objects in `reflect` account for about > > > > > > > > 50% of heap in terms of size, and does seem to grow as the > > > > > > > > service is running out of memory, but that's only 12MB so I > > > > > > > > thought it was a red herring. > > > > > > > > > > > > > > > > Heap profile: > > > > > > > > Type: inuse_space > > > > > > > > Time: Jun 30, 2019 at 4:46am (EDT) > > > > > > > > Entering interactive mode (type "help" for commands, "o" for > > > > > > > > options) > > > > > > > > (pprof) inuse_objects > > > > > > > > (pprof) top > > > > > > > > Showing nodes accounting for 414485, 100% of 414485 total > > > > > > > > Showing top 10 nodes out of 81 > > > > > > > > flat flat% sum% cum cum% > > > > > > > > 344074 83.01% 83.01% 344074 83.01% > > > > > > > > reflect.funcLayout.func1 > > > > > > > > 32768 7.91% 90.92% 376842 90.92% reflect.callReflect > > > > > > > > 16384 3.95% 94.87% 16384 3.95% > > > > > > > > github.com/json-iterator/go.processTags > > > > > > > > 10923 2.64% 97.51% 10923 2.64% context.WithValue > > > > > > > > 8192 1.98% 99.48% 8192 1.98% crypto/hmac.New > > > > > > > > 1260 0.3% 99.79% 1260 0.3% > > > > > > > > github.com/aws/aws-sdk-go/aws/signer/v4.(*signingCtx).buildCanonicalHeaders > > > > > > > > 820 0.2% 100% 820 0.2% > > > > > > > > github.com/stripe/veneur/tdigest.NewMerging > > > > > > > > 64 0.015% 100% 64 0.015% > > > > > > > > reflect.addReflectOff > > > > > > > > 0 0% 100% 820 0.2% > > > > > > > > git.sqcorp.co/cash/digester/lib/pkg/histogram.(*SlidingWindowDigest).Observe > > > > > > > > 0 0% 100% 820 0.2% > > > > > > > > git.sqcorp.co/cash/digester/lib/pkg/histogram.(*SlidingWindowDigest).openDigests > > > > > > > > (pprof) cum > > > > > > > > (pprof) top > > > > > > > > Showing nodes accounting for 376842, 90.92% of 414485 total > > > > > > > > Showing top 10 nodes out of 81 > > > > > > > > flat flat% sum% cum cum% > > > > > > > > 0 0% 0% 376842 90.92% > > > > > > > > net/http/httptrace.(*ClientTrace).compose.func1 > > > > > > > > 0 0% 0% 376842 90.92% reflect.Value.Call > > > > > > > > 0 0% 0% 376842 90.92% reflect.Value.call > > > > > > > > 32768 7.91% 7.91% 376842 90.92% reflect.callReflect > > > > > > > > 0 0% 7.91% 376842 90.92% reflect.makeFuncStub > > > > > > > > 344074 83.01% 90.92% 344074 83.01% > > > > > > > > reflect.funcLayout.func1 > > > > > > > > 0 0% 90.92% 344074 83.01% sync.(*Pool).Get > > > > > > > > 0 0% 90.92% 16448 3.97% > > > > > > > > github.com/json-iterator/go._createDecoderOfType > > > > > > > > 0 0% 90.92% 16448 3.97% > > > > > > > > github.com/json-iterator/go.createDecoderOfType > > > > > > > > 0 0% 90.92% 16448 3.97% > > > > > > > > github.com/json-iterator/go.decoderOfStruct > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Jul 1, 2019 at 4:32 PM Robert Engels > > > > > > > > > <reng...@ix.netcom.com> wrote: > > > > > > > > > > > > > > > > > > > > A leak of the TCP connections (maybe not properly closed)? > > > > > > > > > > Each TCP connection will use kernel memory and process > > > > > > > > > > memory (local buffers), that won't be on the heap (the > > > > > > > > > > reference to the TCP connection will be in the Go heap, but > > > > > > > > > > is probably much smaller than the buffer allocation). > > > > > > > > > > > > > > > > > > > > That would be my guess - but just a guess. > > > > > > > > > > > -----Original Message----- > > > > > > > > > > > From: 'Yunchi Luo' via golang-nuts > > > > > > > > > > > Sent: Jul 1, 2019 2:14 PM > > > > > > > > > > > To: golang-nuts@googlegroups.com > > > > > > > > > > > Cc: Alec Thomas > > > > > > > > > > > Subject: [go-nuts] OOM occurring with a small heap > > > > > > > > > > > > > > > > > > > > > > Hello, I'd like to solicit some help with a weird GC > > > > > > > > > > > issue we are seeing. > > > > > > > > > > > > > > > > > > > > > > I'm trying to debug OOM on a service we are running in > > > > > > > > > > > k8s. The service is just a CRUD server hitting a database > > > > > > > > > > > (DynamoDB). Each replica serves about 300 qps of traffic. > > > > > > > > > > > There are no memory leaks. On occasion (seemingly > > > > > > > > > > > correlated to small latency spikes on the backend), the > > > > > > > > > > > service would OOM. This is surprising because it has a > > > > > > > > > > > circuit breaker that drops requests after 200 concurrent > > > > > > > > > > > connections that has never trips, and goroutine profiles > > > > > > > > > > > confirm that there are nowhere 200 active goroutines. > > > > > > > > > > > > > > > > > > > > > > GC logs are pasted below. It's interlaced with dumps of > > > > > > > > > > > runtime.Memstats (the RSS number is coming from > > > > > > > > > > > /proc/<pid>/stats). Go version is 1.12.5, running an > > > > > > > > > > > Alpine 3.10 container in an Amazon kernel > > > > > > > > > > > 4.14.123-111.109.amzn2.x86_64. > > > > > > > > > > > > > > > > > > > > > > The service happily serves requests using ~50MB of RSS > > > > > > > > > > > for hours, until the last 2 seconds, where GC mark&sweep > > > > > > > > > > > time starts to 2-4X per cycle (43+489/158/0.60+0.021 ms > > > > > > > > > > > cpu => 43+489/158/0.60+0.021 ms cpu), and RSS and Sys > > > > > > > > > > > blow up. It’s also interesting that in the last log line: > > > > > > > > > > > `Sys=995MB RSS=861MB HeapSys=199MB`. If I’m reading this > > > > > > > > > > > correctly, there’s at least `662MB` of memory in RSS that > > > > > > > > > > > is not assigned to the heap. Though this might be due to > > > > > > > > > > > the change in 1.125 to use MADV_FREE, so the pages are > > > > > > > > > > > freeable not yet reclaimed by the kernel. > > > > > > > > > > > > > > > > > > > > > > I don’t understand how heap can be so small across gc > > > > > > > > > > > cycles (28->42->30MB on the last line means heap doesn't > > > > > > > > > > > grow past 42MB?), yet RSS keeps growing. I'm assuming the > > > > > > > > > > > increased RSS is causing the kernel to OOM the service, > > > > > > > > > > > but that should only happen if the RSS is not freeable as > > > > > > > > > > > marked by MADV_FREE. There doesn't seem to be any > > > > > > > > > > > indication of that from the GC logs. I guess this all > > > > > > > > > > > comes down to me not having a good understanding of how > > > > > > > > > > > the GC algorithm works and how to read these logs. I'd > > > > > > > > > > > really appreciate it if anyone can explain what's > > > > > > > > > > > happening and why. > > > > > > > > > > > > > > > > > > > > > > gc 41833 @19135.227s 0%: 0.019+2.3+0.005 ms clock, > > > > > > > > > > > 0.079+0.29/2.2/5.6+0.020 ms cpu, 11->11->5 MB, 12 MB > > > > > > > > > > > goal, 4 P > > > > > > > > > > > INFO 2019-06-30T08:46:04.886 [Memory]: Alloc=7MB > > > > > > > > > > > TotalAlloc=230172MB Sys=69MB RSS=51MB HeapSys=62MB > > > > > > > > > > > HeapIdle=51MB HeapInUse=11MB HeapReleased=5MB > > > > > > > > > > > gc 41834 @19135.869s 0%: 0.005+2.9+0.003 ms clock, > > > > > > > > > > > 0.023+0.32/2.5/6.6+0.012 ms cpu, 11->11->5 MB, 12 MB > > > > > > > > > > > goal, 4 P > > > > > > > > > > > INFO 2019-06-30T08:46:05.886 [Memory]: Alloc=9MB > > > > > > > > > > > TotalAlloc=230179MB Sys=69MB RSS=51MB HeapSys=62MB > > > > > > > > > > > HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB > > > > > > > > > > > gc 41835 @19136.704s 0%: 0.038+2.1+0.004 ms clock, > > > > > > > > > > > 0.15+0.35/2.1/5.3+0.016 ms cpu, 11->11->5 MB, 12 MB goal, > > > > > > > > > > > 4 P > > > > > > > > > > > INFO 2019-06-30T08:46:06.886 [Memory]: Alloc=9MB > > > > > > > > > > > TotalAlloc=230184MB Sys=69MB RSS=51MB HeapSys=62MB > > > > > > > > > > > HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB > > > > > > > > > > > gc 41836 @19137.611s 0%: 0.009+2.1+0.003 ms clock, > > > > > > > > > > > 0.036+0.39/2.0/5.7+0.015 ms cpu, 11->11->5 MB, 12 MB > > > > > > > > > > > goal, 4 P > > > > > > > > > > > INFO 2019-06-30T08:46:07.887 [Memory]: Alloc=10MB > > > > > > > > > > > TotalAlloc=230190MB Sys=69MB RSS=51MB HeapSys=62MB > > > > > > > > > > > HeapIdle=49MB HeapInUse=12MB HeapReleased=5MB > > > > > > > > > > > gc 41837 @19138.444s 0%: 0.008+2.1+0.004 ms clock, > > > > > > > > > > > 0.035+0.51/2.1/5.7+0.017 ms cpu, 11->11->5 MB, 12 MB > > > > > > > > > > > goal, 4 P > > > > > > > > > > > INFO 2019-06-30T08:46:08.887 [Memory]: Alloc=10MB > > > > > > > > > > > TotalAlloc=230195MB Sys=69MB RSS=51MB HeapSys=62MB > > > > > > > > > > > HeapIdle=49MB HeapInUse=12MB HeapReleased=5MB > > > > > > > > > > > gc 41838 @19139.474s 0%: 0.005+2.6+0.003 ms clock, > > > > > > > > > > > 0.023+0.37/2.5/4.3+0.014 ms cpu, 11->11->5 MB, 12 MB > > > > > > > > > > > goal, 4 P > > > > > > > > > > > gc 41839 @19140.173s 0%: 0.011+2.4+0.003 ms clock, > > > > > > > > > > > 0.046+0.20/2.3/5.8+0.015 ms cpu, 11->11->5 MB, 12 MB > > > > > > > > > > > goal, 4 P > > > > > > > > > > > INFO 2019-06-30T08:46:09.887 [Memory]: Alloc=7MB > > > > > > > > > > > TotalAlloc=230202MB Sys=69MB RSS=51MB HeapSys=62MB > > > > > > > > > > > HeapIdle=50MB HeapInUse=11MB HeapReleased=5MB > > > > > > > > > > > gc 41840 @19140.831s 0%: 0.082+2.1+0.003 ms clock, > > > > > > > > > > > 0.32+0.64/2.1/5.3+0.014 ms cpu, 11->11->5 MB, 12 MB goal, > > > > > > > > > > > 4 P > > > > > > > > > > > INFO 2019-06-30T08:46:10.887 [Memory]: Alloc=9MB > > > > > > > > > > > TotalAlloc=230209MB Sys=69MB RSS=51MB HeapSys=62MB > > > > > > > > > > > HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB > > > > > > > > > > > gc 41841 @19141.655s 0%: 0.014+2.1+0.003 ms clock, > > > > > > > > > > > 0.056+0.28/2.0/5.7+0.013 ms cpu, 11->11->5 MB, 12 MB > > > > > > > > > > > goal, 4 P > > > > > > > > > > > gc 41842 @19142.316s 0%: 0.006+2.7+0.003 ms clock, > > > > > > > > > > > 0.027+0.29/2.6/6.2+0.014 ms cpu, 11->11->5 MB, 12 MB > > > > > > > > > > > goal, 4 P > > > > > > > > > > > INFO 2019-06-30T08:46:11.888 [Memory]: Alloc=6MB > > > > > > > > > > > TotalAlloc=230216MB Sys=69MB RSS=51MB HeapSys=62MB > > > > > > > > > > > HeapIdle=51MB HeapInUse=11MB HeapReleased=5MB > > > > > > > > > > > gc 41843 @19142.942s 0%: 0.010+2.1+0.005 ms clock, > > > > > > > > > > > 0.040+0.29/2.0/5.7+0.023 ms cpu, 11->11->5 MB, 12 MB > > > > > > > > > > > goal, 4 P > > > > > > > > > > > INFO 2019-06-30T08:46:12.888 [Memory]: Alloc=9MB > > > > > > > > > > > TotalAlloc=230223MB Sys=69MB RSS=51MB HeapSys=62MB > > > > > > > > > > > HeapIdle=50MB HeapInUse=11MB HeapReleased=5MB > > > > > > > > > > > gc 41844 @19143.724s 0%: 0.008+2.4+0.004 ms clock, > > > > > > > > > > > 0.035+0.38/2.0/5.7+0.017 ms cpu, 11->11->5 MB, 12 MB > > > > > > > > > > > goal, 4 P > > > > > > > > > > > gc 41845 @19144.380s 0%: 10+9.3+0.044 ms clock, > > > > > > > > > > > 43+6.1/9.2/4.4+0.17 ms cpu, 11->11->6 MB, 12 MB goal, 4 P > > > > > > > > > > > INFO 2019-06-30T08:46:13.901 [Memory]: Alloc=6MB > > > > > > > > > > > TotalAlloc=230230MB Sys=136MB RSS=98MB HeapSys=94MB > > > > > > > > > > > HeapIdle=83MB HeapInUse=11MB HeapReleased=35MB > > > > > > > > > > > gc 41846 @19144.447s 0%: 0.008+26+0.005 ms clock, > > > > > > > > > > > 0.033+0.46/7.8/26+0.020 ms cpu, 11->12->9 MB, 12 MB goal, > > > > > > > > > > > 4 P > > > > > > > > > > > gc 41847 @19144.672s 0%: 0.013+76+0.006 ms clock, > > > > > > > > > > > 0.053+0.20/6.4/80+0.024 ms cpu, 17->18->8 MB, 18 MB goal, > > > > > > > > > > > 4 P > > > > > > > > > > > gc 41848 @19145.014s 0%: 0.008+172+0.005 ms clock, > > > > > > > > > > > 0.035+0.13/8.5/177+0.022 ms cpu, 15->17->10 MB, 16 MB > > > > > > > > > > > goal, 4 P > > > > > > > > > > > gc 41849 @19145.298s 0%: 0.007+285+0.006 ms clock, > > > > > > > > > > > 0.030+10/285/7.6+0.024 ms cpu, 19->23->15 MB, 20 MB goal, > > > > > > > > > > > 4 P > > > > > > > > > > > INFO 2019-06-30T08:46:15.052 [Memory]: Alloc=22MB > > > > > > > > > > > TotalAlloc=230264MB Sys=598MB RSS=531MB HeapSys=265MB > > > > > > > > > > > HeapIdle=240MB HeapInUse=25MB HeapReleased=164MB > > > > > > > > > > > gc 41850 @19145.665s 0%: 10+419+0.005 ms clock, > > > > > > > > > > > 43+489/158/0.60+0.021 ms cpu, 26->30->17 MB, 30 MB goal, > > > > > > > > > > > 4 P > > > > > > > > > > > gc 41851 @19146.325s 0%: 21+798+0.036 ms clock, > > > > > > > > > > > 86+990/401/0+0.14 ms cpu, 28->42->30 MB, 34 MB goal, 4 P > > > > > > > > > > > INFO 2019-06-30T08:46:16.613 [Memory]: Alloc=41MB > > > > > > > > > > > TotalAlloc=230303MB Sys=995MB RSS=861MB HeapSys=199MB > > > > > > > > > > > HeapIdle=155MB HeapInUse=44MB HeapReleased=54MB > > > > > > > > > > > > > > > > > > > > > > I also captured the OOM log from dmesg here > > > > > > > > > > > https://gist.github.com/mightyguava/7ecc6fc55f5cd925062d6beede3783b3. > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > Yunchi > > > > > > > > > > > -- > > > > > > > > > > > You received this message because you are subscribed to > > > > > > > > > > > the Google Groups "golang-nuts" group. > > > > > > > > > > > To unsubscribe from this group and stop receiving emails > > > > > > > > > > > from it, send an email to > > > > > > > > > > > golang-nuts+unsubscr...@googlegroups.com. > > > > > > > > > > > To view this discussion on the web visit > > > > > > > > > > > https://groups.google.com/d/msgid/golang-nuts/CANnT9sj1_sZCKDkGbkzarwcn8DYEX9OS6Ack%2B71613eyLQ7y6w%40mail.gmail.com. > > > > > > > > > > > For more options, visit > > > > > > > > > > > https://groups.google.com/d/optout. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Yunchi > > > > > > > > -- > > > > > > > > You received this message because you are subscribed to the > > > > > > > > Google Groups "golang-nuts" group. > > > > > > > > To unsubscribe from this group and stop receiving emails from > > > > > > > > it, send an email to golang-nuts+unsubscr...@googlegroups.com. > > > > > > > > To view this discussion on the web visit > > > > > > > > https://groups.google.com/d/msgid/golang-nuts/CANnT9siMD4L5oC0nvmGYYM3Qo4rVSSCT%2BK5__fR9%3DzuyXV6S0Q%40mail.gmail.com. > > > > > > > > For more options, visit https://groups.google.com/d/optout. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Yunchi > > > > > -- > > > > > You received this message because you are subscribed to the Google > > > > > Groups "golang-nuts" group. > > > > > To unsubscribe from this group and stop receiving emails from it, > > > > > send an email to golang-nuts+unsubscr...@googlegroups.com. > > > > > To view this discussion on the web visit > > > > > https://groups.google.com/d/msgid/golang-nuts/CANnT9sjEo%2Bp2HYOVDjB_EzaGE474QuBy-9Yp_HaCn95wHSFsHQ%40mail.gmail.com. > > > > > For more options, visit https://groups.google.com/d/optout. > > > -- > > > Michael T. Jones > > > michael.jo...@gmail.com > > > -- > > > You received this message because you are subscribed to the Google Groups > > > "golang-nuts" group. > > > To unsubscribe from this group and stop receiving emails from it, send an > > > email to golang-nuts+unsubscr...@googlegroups.com. > > > To view this discussion on the web visit > > > https://groups.google.com/d/msgid/golang-nuts/CALoEmQz%3DsJeq%3DpiDKQaa-uyTFXUsN69MsUUB8mQ8qbqVrOtCng%40mail.gmail.com. > > > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/21f38b9c-5b83-407f-b7fa-659b896698a7%40Spark. For more options, visit https://groups.google.com/d/optout.