Thanks for the various possibilities.  It’s helpful to know things to look out 
for.

I don’t think there’s a tight polling loop in the code— at least I haven’t 
found it yet.

We’re using some open source packages, notably Kanister 
(https://github.com/kanisterio) which Kasten contributes, Stow 
(https://github.com/graymeta/stow) and in this case some Google cloud APIs 
through Stow.

On more than one occasion it has looked like the application goroutine gets 
stuck allocating an 8MB byte slice.  (8MB is the default chunk size in one of 
the Google cloud storage APIs and Stow lets it default.)

I don’t know whether an 8MB malloc is a problem per se or perhaps that a few of 
those trigger GC and take the shouldhelpgc path.

It’s most likely that the problem is in our code somewhere or settings that 
match our usage pattern of growing and shrinking the live heap.  I have 
wondered whether we might be hitting a corner case in the runtime (missed 
wakeup?).  Given the wide usage of Go across a variety of different types of 
programs, I doubt it, but something sure looks funny with long StopTheWorld 
times.

Any additional thought on diagnosis or experiments?  If I can’t figure out 
exactly what’s happening I’ll see about a workaround such as explicit calls to 
runtme.GC() at safe spots or periodically, etc.


Eric


p.s. Here’s the top of a stack trace triggered with a panic in SIGTERM handler. 
 It shows the make([]byte, 0, 8 * 1024 *1024) triggering gcStart().


goroutine 15 [runnable]:
runtime.gcStart(0x0, 0x1, 0x0, 0x0)
        /usr/local/go/src/runtime/mgc.go:1236 +0x583 fp=0xc000cd7540 
sp=0xc000cd7538 pc=0x419e63
runtime.mallocgc(0x800000, 0x1c04f80, 0x1, 0x600)
        /usr/local/go/src/runtime/malloc.go:996 +0x40e fp=0xc000cd75e0 
sp=0xc000cd7540 pc=0x40bb0e
runtime.makeslice(0x1c04f80, 0x0, 0x800000, 0xc00022ad50, 0xc000cd7678, 
0x15cfcdc)
        /usr/local/go/src/runtime/slice.go:70 +0x77 fp=0xc000cd7610 
sp=0xc000cd75e0 pc=0x4414f7
vendor/google.golang.org/api/gensupport.NewMediaBuffer(...)
        /repo/go/src/vendor/google.golang.org/api/gensupport/buffer.go:26
vendor/google.golang.org/api/gensupport.PrepareUpload(0x23aa900, 0xc000039b30, 
0x800000, 0x0, 0x23aa900, 0xc000039b30)
        /repo/go/src/vendor/google.golang.org/api/gensupport/media.go:188 +0x50 
fp=0xc000cd7688 sp=0xc000cd7610 pc=0x15d03e0
vendor/google.golang.org/api/storage/v1.(*ObjectsInsertCall).Media(0xc000b2e780,
 0x23a5a80, 0xc00022ad50, 0x0, 0x0, 0x0, 0xe)
        
/repo/go/src/vendor/google.golang.org/api/storage/v1/storage-gen.go:8622 +0xb8 
fp=0xc000cd7718 sp=0xc000cd7688 pc=0x195c188
vendor/github.com/graymeta/stow/google.(*Container).Put(0xc000de4be0, 
0xc00058a871, 0x86, 0x23a5a80, 0xc00022ad50, 0x45, 0xc00022ad80, 0x50, 
0xc000cd7840, 0x40c278, ...)
        /repo/go/src/vendor/github.com/graymeta/stow/google/container.go:144 
+0x186 fp=0xc000cd77d0 sp=0xc000cd7718 pc=0x1969d36
github.com/kanisterio/kanister/pkg/objectstore.(*directory).Put(0xc000de5b20, 
0x23c9740, 0xc00090b200, 0x20fdcf9, 0xe, 0x23a5a80, 0xc00022ad50, 0x45, 0x0, 
0xc000171ad0, ...)
        
/repo/go/src/github.com/kanisterio/kanister/pkg/objectstore/directory.go:206 
+0xe9 fp=0xc000cd7850 sp=0xc000cd77d0 pc=0x1973769
github.com/kanisterio/kanister/pkg/objectstore.(*directory).PutBytes(0xc000de5b20,
 0x23c9740, 0xc00090b200, 0x20fdcf9, 0xe, 0xc000e4e410, 0x45, 0x50, 0x0, 
0x9d895a, ...)
        
/repo/go/src/github.com/kanisterio/kanister/pkg/objectstore/directory.go:212 
+0xd5 fp=0xc000cd78b8 sp=0xc000cd7850 pc=0x19739a5




> On Dec 11, 2018, at 8:21 PM, robert engels <reng...@ix.netcom.com> wrote:
> 
> You might want to review this https://github.com/golang/go/issues/10958 
> <https://github.com/golang/go/issues/10958>
> 
>> On Dec 11, 2018, at 10:16 PM, robert engels <reng...@ix.netcom.com 
>> <mailto:reng...@ix.netcom.com>> wrote:
>> 
>> Reviewing the code, the 5s of cpu time is technically the stop-the-world 
>> (STW) sweep termination
>> 
>> So, I think the cause of your problem is that you have a tight / “infinite” 
>> loop that is not calling runtime.Gosched(), so it is taking a very long time 
>> for the sweep termination to complete.
>> 
>> 
>>> On Dec 11, 2018, at 9:43 PM, robert engels <reng...@ix.netcom.com 
>>> <mailto:reng...@ix.netcom.com>> wrote:
>>> 
>>> Well, your pause is clearly related to the GC - the first phase, the mark, 
>>> is 5s in #17. Are you certain you don’t have an incorrect highly recursive 
>>> loop that is causing the stack marking to take a really long time… ?
>>> 
>>> 
>>>> On Dec 11, 2018, at 8:45 PM, Eric Hamilton <e...@kasten.io 
>>>> <mailto:e...@kasten.io>> wrote:
>>>> 
>>>> Of course.  (I forgot about that option and I'd collected those traces at 
>>>> a time when I thought I'd ruled out GC-- probably misread MemStats values).
>>>> 
>>>> Here's the gctrace output for a repro with a 5.6 second delay ending with 
>>>> finish of gc17 and a 30+ second delay ending in SIGTERM (which coincides 
>>>> with completion of gc18):
>>>> 
>>>> gc 1 @0.022s 1%: 0.94+10+0.19 ms clock, 0.94+0.65/0.60/5.3+0.19 ms cpu, 
>>>> 4->4->1 MB, 5 MB goal, 4 P
>>>> gc 2 @0.048s 3%: 0.007+5.9+0.17 ms clock, 0.007+0.24/5.7/1.0+0.17 ms cpu, 
>>>> 4->4->1 MB, 5 MB goal, 4 P
>>>> gc 3 @0.062s 3%: 0.004+3.7+0.18 ms clock, 0.004+0.19/0.78/2.8+0.18 ms cpu, 
>>>> 4->4->1 MB, 5 MB goal, 4 P
>>>> gc 4 @0.075s 13%: 0.003+3.4+69 ms clock, 0.003+0.069/0.99/2.3+69 ms cpu, 
>>>> 4->4->2 MB, 5 MB goal, 4 P
>>>> gc 5 @0.154s 12%: 0.007+7.4+0.032 ms clock, 0.007+0.15/0/5.1+0.032 ms cpu, 
>>>> 4->5->3 MB, 5 MB goal, 4 P
>>>> gc 6 @0.234s 10%: 0.004+34+0.025 ms clock, 0.004+0/33/6.8+0.025 ms cpu, 
>>>> 6->7->5 MB, 7 MB goal, 4 P
>>>> gc 7 @0.296s 8%: 0.005+28+0.031 ms clock, 0.005+0/0.99/50+0.031 ms cpu, 
>>>> 10->10->5 MB, 11 MB goal, 4 P
>>>> gc 8 @0.416s 6%: 0.007+49+1.1 ms clock, 0.007+0.13/4.5/32+1.1 ms cpu, 
>>>> 11->11->6 MB, 12 MB goal, 4 P
>>>> gc 9 @0.591s 5%: 0.006+23+0.032 ms clock, 0.006+0/6.5/31+0.032 ms cpu, 
>>>> 12->12->7 MB, 13 MB goal, 4 P
>>>> gc 10 @0.640s 5%: 0.006+10+0.032 ms clock, 0.006+0/3.6/12+0.032 ms cpu, 
>>>> 13->13->8 MB, 14 MB goal, 4 P
>>>> gc 11 @120.657s 0%: 0.012+11+0.030 ms clock, 0.012+0/1.7/23+0.030 ms cpu, 
>>>> 13->13->9 MB, 16 MB goal, 4 P
>>>> gc 12 @240.674s 0%: 0.011+12+0.037 ms clock, 0.011+0/6.1/27+0.037 ms cpu, 
>>>> 10->10->9 MB, 18 MB goal, 4 P
>>>> gc 13 @360.691s 0%: 0.011+10+0.022 ms clock, 0.011+0/2.7/25+0.022 ms cpu, 
>>>> 10->10->9 MB, 18 MB goal, 4 P
>>>> gc 14 @480.711s 0%: 0.018+11+0.021 ms clock, 0.018+0/4.1/27+0.021 ms cpu, 
>>>> 10->10->9 MB, 18 MB goal, 4 P
>>>> gc 15 @600.727s 0%: 0.016+14+0.030 ms clock, 0.016+0/7.0/30+0.030 ms cpu, 
>>>> 11->11->9 MB, 18 MB goal, 4 P
>>>> gc 16 @720.746s 0%: 0.016+13+0.023 ms clock, 0.016+0/5.7/27+0.023 ms cpu, 
>>>> 10->10->9 MB, 18 MB goal, 4 P
>>>> gc 17 @816.508s 0%: 5560+26+0.067 ms clock, 5560+7.3/26/22+0.067 ms cpu, 
>>>> 18->19->11 MB, 19 MB goal, 4 P
>>>> gc 18 @824.133s 1%: 40566+16+0.035 ms clock, 40566+6.8/15/10+0.035 ms cpu, 
>>>> 23->24->18 MB, 24 MB goal, 4 P
>>>> 
>>>> I've attached a file that also includes the schedtrace=1000, health check 
>>>> log messages, key app logs, and the runtime.MemStats collected at the end 
>>>> of the same run.
>>>> 
>>>> Thanks again for any insights,
>>>> Eric
>>>> 
>>>> 
>>>> On Tuesday, December 11, 2018 at 4:32:33 PM UTC-8, robert engels wrote:
>>>> I think it would be more helpful if you used gctrace=1 to report on the GC 
>>>> activity.
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> You received this message because you are subscribed to the Google Groups 
>>>> "golang-nuts" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>>> email to golang-nuts+unsubscr...@googlegroups.com 
>>>> <mailto:golang-nuts+unsubscr...@googlegroups.com>.
>>>> For more options, visit https://groups.google.com/d/optout 
>>>> <https://groups.google.com/d/optout>.
>>>> <trace.medium.txt>
>>> 
>>> 
>>> -- 
>>> You received this message because you are subscribed to the Google Groups 
>>> "golang-nuts" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>> email to golang-nuts+unsubscr...@googlegroups.com 
>>> <mailto:golang-nuts+unsubscr...@googlegroups.com>.
>>> For more options, visit https://groups.google.com/d/optout 
>>> <https://groups.google.com/d/optout>.
>> 
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "golang-nuts" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to golang-nuts+unsubscr...@googlegroups.com 
>> <mailto:golang-nuts+unsubscr...@googlegroups.com>.
>> For more options, visit https://groups.google.com/d/optout 
>> <https://groups.google.com/d/optout>.
> 

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to