Thanks for the various possibilities. It’s helpful to know things to look out for.
I don’t think there’s a tight polling loop in the code— at least I haven’t found it yet. We’re using some open source packages, notably Kanister (https://github.com/kanisterio) which Kasten contributes, Stow (https://github.com/graymeta/stow) and in this case some Google cloud APIs through Stow. On more than one occasion it has looked like the application goroutine gets stuck allocating an 8MB byte slice. (8MB is the default chunk size in one of the Google cloud storage APIs and Stow lets it default.) I don’t know whether an 8MB malloc is a problem per se or perhaps that a few of those trigger GC and take the shouldhelpgc path. It’s most likely that the problem is in our code somewhere or settings that match our usage pattern of growing and shrinking the live heap. I have wondered whether we might be hitting a corner case in the runtime (missed wakeup?). Given the wide usage of Go across a variety of different types of programs, I doubt it, but something sure looks funny with long StopTheWorld times. Any additional thought on diagnosis or experiments? If I can’t figure out exactly what’s happening I’ll see about a workaround such as explicit calls to runtme.GC() at safe spots or periodically, etc. Eric p.s. Here’s the top of a stack trace triggered with a panic in SIGTERM handler. It shows the make([]byte, 0, 8 * 1024 *1024) triggering gcStart(). goroutine 15 [runnable]: runtime.gcStart(0x0, 0x1, 0x0, 0x0) /usr/local/go/src/runtime/mgc.go:1236 +0x583 fp=0xc000cd7540 sp=0xc000cd7538 pc=0x419e63 runtime.mallocgc(0x800000, 0x1c04f80, 0x1, 0x600) /usr/local/go/src/runtime/malloc.go:996 +0x40e fp=0xc000cd75e0 sp=0xc000cd7540 pc=0x40bb0e runtime.makeslice(0x1c04f80, 0x0, 0x800000, 0xc00022ad50, 0xc000cd7678, 0x15cfcdc) /usr/local/go/src/runtime/slice.go:70 +0x77 fp=0xc000cd7610 sp=0xc000cd75e0 pc=0x4414f7 vendor/google.golang.org/api/gensupport.NewMediaBuffer(...) /repo/go/src/vendor/google.golang.org/api/gensupport/buffer.go:26 vendor/google.golang.org/api/gensupport.PrepareUpload(0x23aa900, 0xc000039b30, 0x800000, 0x0, 0x23aa900, 0xc000039b30) /repo/go/src/vendor/google.golang.org/api/gensupport/media.go:188 +0x50 fp=0xc000cd7688 sp=0xc000cd7610 pc=0x15d03e0 vendor/google.golang.org/api/storage/v1.(*ObjectsInsertCall).Media(0xc000b2e780, 0x23a5a80, 0xc00022ad50, 0x0, 0x0, 0x0, 0xe) /repo/go/src/vendor/google.golang.org/api/storage/v1/storage-gen.go:8622 +0xb8 fp=0xc000cd7718 sp=0xc000cd7688 pc=0x195c188 vendor/github.com/graymeta/stow/google.(*Container).Put(0xc000de4be0, 0xc00058a871, 0x86, 0x23a5a80, 0xc00022ad50, 0x45, 0xc00022ad80, 0x50, 0xc000cd7840, 0x40c278, ...) /repo/go/src/vendor/github.com/graymeta/stow/google/container.go:144 +0x186 fp=0xc000cd77d0 sp=0xc000cd7718 pc=0x1969d36 github.com/kanisterio/kanister/pkg/objectstore.(*directory).Put(0xc000de5b20, 0x23c9740, 0xc00090b200, 0x20fdcf9, 0xe, 0x23a5a80, 0xc00022ad50, 0x45, 0x0, 0xc000171ad0, ...) /repo/go/src/github.com/kanisterio/kanister/pkg/objectstore/directory.go:206 +0xe9 fp=0xc000cd7850 sp=0xc000cd77d0 pc=0x1973769 github.com/kanisterio/kanister/pkg/objectstore.(*directory).PutBytes(0xc000de5b20, 0x23c9740, 0xc00090b200, 0x20fdcf9, 0xe, 0xc000e4e410, 0x45, 0x50, 0x0, 0x9d895a, ...) /repo/go/src/github.com/kanisterio/kanister/pkg/objectstore/directory.go:212 +0xd5 fp=0xc000cd78b8 sp=0xc000cd7850 pc=0x19739a5 > On Dec 11, 2018, at 8:21 PM, robert engels <reng...@ix.netcom.com> wrote: > > You might want to review this https://github.com/golang/go/issues/10958 > <https://github.com/golang/go/issues/10958> > >> On Dec 11, 2018, at 10:16 PM, robert engels <reng...@ix.netcom.com >> <mailto:reng...@ix.netcom.com>> wrote: >> >> Reviewing the code, the 5s of cpu time is technically the stop-the-world >> (STW) sweep termination >> >> So, I think the cause of your problem is that you have a tight / “infinite” >> loop that is not calling runtime.Gosched(), so it is taking a very long time >> for the sweep termination to complete. >> >> >>> On Dec 11, 2018, at 9:43 PM, robert engels <reng...@ix.netcom.com >>> <mailto:reng...@ix.netcom.com>> wrote: >>> >>> Well, your pause is clearly related to the GC - the first phase, the mark, >>> is 5s in #17. Are you certain you don’t have an incorrect highly recursive >>> loop that is causing the stack marking to take a really long time… ? >>> >>> >>>> On Dec 11, 2018, at 8:45 PM, Eric Hamilton <e...@kasten.io >>>> <mailto:e...@kasten.io>> wrote: >>>> >>>> Of course. (I forgot about that option and I'd collected those traces at >>>> a time when I thought I'd ruled out GC-- probably misread MemStats values). >>>> >>>> Here's the gctrace output for a repro with a 5.6 second delay ending with >>>> finish of gc17 and a 30+ second delay ending in SIGTERM (which coincides >>>> with completion of gc18): >>>> >>>> gc 1 @0.022s 1%: 0.94+10+0.19 ms clock, 0.94+0.65/0.60/5.3+0.19 ms cpu, >>>> 4->4->1 MB, 5 MB goal, 4 P >>>> gc 2 @0.048s 3%: 0.007+5.9+0.17 ms clock, 0.007+0.24/5.7/1.0+0.17 ms cpu, >>>> 4->4->1 MB, 5 MB goal, 4 P >>>> gc 3 @0.062s 3%: 0.004+3.7+0.18 ms clock, 0.004+0.19/0.78/2.8+0.18 ms cpu, >>>> 4->4->1 MB, 5 MB goal, 4 P >>>> gc 4 @0.075s 13%: 0.003+3.4+69 ms clock, 0.003+0.069/0.99/2.3+69 ms cpu, >>>> 4->4->2 MB, 5 MB goal, 4 P >>>> gc 5 @0.154s 12%: 0.007+7.4+0.032 ms clock, 0.007+0.15/0/5.1+0.032 ms cpu, >>>> 4->5->3 MB, 5 MB goal, 4 P >>>> gc 6 @0.234s 10%: 0.004+34+0.025 ms clock, 0.004+0/33/6.8+0.025 ms cpu, >>>> 6->7->5 MB, 7 MB goal, 4 P >>>> gc 7 @0.296s 8%: 0.005+28+0.031 ms clock, 0.005+0/0.99/50+0.031 ms cpu, >>>> 10->10->5 MB, 11 MB goal, 4 P >>>> gc 8 @0.416s 6%: 0.007+49+1.1 ms clock, 0.007+0.13/4.5/32+1.1 ms cpu, >>>> 11->11->6 MB, 12 MB goal, 4 P >>>> gc 9 @0.591s 5%: 0.006+23+0.032 ms clock, 0.006+0/6.5/31+0.032 ms cpu, >>>> 12->12->7 MB, 13 MB goal, 4 P >>>> gc 10 @0.640s 5%: 0.006+10+0.032 ms clock, 0.006+0/3.6/12+0.032 ms cpu, >>>> 13->13->8 MB, 14 MB goal, 4 P >>>> gc 11 @120.657s 0%: 0.012+11+0.030 ms clock, 0.012+0/1.7/23+0.030 ms cpu, >>>> 13->13->9 MB, 16 MB goal, 4 P >>>> gc 12 @240.674s 0%: 0.011+12+0.037 ms clock, 0.011+0/6.1/27+0.037 ms cpu, >>>> 10->10->9 MB, 18 MB goal, 4 P >>>> gc 13 @360.691s 0%: 0.011+10+0.022 ms clock, 0.011+0/2.7/25+0.022 ms cpu, >>>> 10->10->9 MB, 18 MB goal, 4 P >>>> gc 14 @480.711s 0%: 0.018+11+0.021 ms clock, 0.018+0/4.1/27+0.021 ms cpu, >>>> 10->10->9 MB, 18 MB goal, 4 P >>>> gc 15 @600.727s 0%: 0.016+14+0.030 ms clock, 0.016+0/7.0/30+0.030 ms cpu, >>>> 11->11->9 MB, 18 MB goal, 4 P >>>> gc 16 @720.746s 0%: 0.016+13+0.023 ms clock, 0.016+0/5.7/27+0.023 ms cpu, >>>> 10->10->9 MB, 18 MB goal, 4 P >>>> gc 17 @816.508s 0%: 5560+26+0.067 ms clock, 5560+7.3/26/22+0.067 ms cpu, >>>> 18->19->11 MB, 19 MB goal, 4 P >>>> gc 18 @824.133s 1%: 40566+16+0.035 ms clock, 40566+6.8/15/10+0.035 ms cpu, >>>> 23->24->18 MB, 24 MB goal, 4 P >>>> >>>> I've attached a file that also includes the schedtrace=1000, health check >>>> log messages, key app logs, and the runtime.MemStats collected at the end >>>> of the same run. >>>> >>>> Thanks again for any insights, >>>> Eric >>>> >>>> >>>> On Tuesday, December 11, 2018 at 4:32:33 PM UTC-8, robert engels wrote: >>>> I think it would be more helpful if you used gctrace=1 to report on the GC >>>> activity. >>>> >>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google Groups >>>> "golang-nuts" group. >>>> To unsubscribe from this group and stop receiving emails from it, send an >>>> email to golang-nuts+unsubscr...@googlegroups.com >>>> <mailto:golang-nuts+unsubscr...@googlegroups.com>. >>>> For more options, visit https://groups.google.com/d/optout >>>> <https://groups.google.com/d/optout>. >>>> <trace.medium.txt> >>> >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "golang-nuts" group. >>> To unsubscribe from this group and stop receiving emails from it, send an >>> email to golang-nuts+unsubscr...@googlegroups.com >>> <mailto:golang-nuts+unsubscr...@googlegroups.com>. >>> For more options, visit https://groups.google.com/d/optout >>> <https://groups.google.com/d/optout>. >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "golang-nuts" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to golang-nuts+unsubscr...@googlegroups.com >> <mailto:golang-nuts+unsubscr...@googlegroups.com>. >> For more options, visit https://groups.google.com/d/optout >> <https://groups.google.com/d/optout>. > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.