Re: [go-nuts] Pause failures, GC, & StopTheWorld

2018-12-14 Thread robert engels
Usually if it is hung doing an allocation, that means that the STW cannot complete, which means some go routine is in a tight loop no yielding to the scheduler. The only other possibility (I would think) is if you placed a cap on the memory size of the process, and it is trying to allocate but

Re: [go-nuts] Pause failures, GC, & StopTheWorld

2018-12-12 Thread Eric Hamilton
We do have a convenient sampler that tells us the state of all of the goroutines, goprocs, and threads: schedtrace+scheddetail. The traces in my first post were one example, but I think I can find/repro a simpler case where we are stuck harder with nothing changing until the SIGTERM comes in.

Re: [go-nuts] Pause failures, GC, & StopTheWorld

2018-12-12 Thread robert engels
I think the allocation triggers the GC, but it is stuck in start because it is waiting for all other Go routines to come to the safe point. I would grab a trace (but I’m not sure you can capture the trace if things are “stuck”) and review that. Or maybe use an external sampler like ‘Instruments’

Re: [go-nuts] Pause failures, GC, & StopTheWorld

2018-12-12 Thread Eric Hamilton
Thanks for the various possibilities. It’s helpful to know things to look out for. I don’t think there’s a tight polling loop in the code— at least I haven’t found it yet. We’re using some open source packages, notably Kanister (https://github.com/kanisterio) which Kasten contributes, Stow (

Re: [go-nuts] Pause failures, GC, & StopTheWorld

2018-12-11 Thread robert engels
You might want to review this https://github.com/golang/go/issues/10958 > On Dec 11, 2018, at 10:16 PM, robert engels wrote: > > Reviewing the code, the 5s of cpu time is technically the stop-the-world > (STW) sweep termination > > So, I think the cause of your problem is that you have a tight

Re: [go-nuts] Pause failures, GC, & StopTheWorld

2018-12-11 Thread robert engels
Reviewing the code, the 5s of cpu time is technically the stop-the-world (STW) sweep termination So, I think the cause of your problem is that you have a tight / “infinite” loop that is not calling runtime.Gosched(), so it is taking a very long time for the sweep termination to complete. > On

Re: [go-nuts] Pause failures, GC, & StopTheWorld

2018-12-11 Thread robert engels
Btw, I only am guessing large stacks because the memory in use is not that great or expanding, and the CPU time is massive - meaning walking lots of stack or root objects. > On Dec 11, 2018, at 9:43 PM, robert engels wrote: > > Well, your pause is clearly related to the GC - the first phase, t

Re: [go-nuts] Pause failures, GC, & StopTheWorld

2018-12-11 Thread robert engels
Well, your pause is clearly related to the GC - the first phase, the mark, is 5s in #17. Are you certain you don’t have an incorrect highly recursive loop that is causing the stack marking to take a really long time… ? > On Dec 11, 2018, at 8:45 PM, Eric Hamilton wrote: > > Of course. (I for

Re: [go-nuts] Pause failures, GC, & StopTheWorld

2018-12-11 Thread Eric Hamilton
Of course. (I forgot about that option and I'd collected those traces at a time when I thought I'd ruled out GC-- probably misread MemStats values). Here's the gctrace output for a repro with a 5.6 second delay ending with finish of gc17 and a 30+ second delay ending in SIGTERM (which coincides

Re: [go-nuts] Pause failures, GC, & StopTheWorld

2018-12-11 Thread robert engels
I think it would be more helpful if you used gctrace=1 to report on the GC activity. > On Dec 11, 2018, at 3:37 PM, e...@kasten.io wrote: > > I am observing pause failures in an application and it appears to me that I’m > seeing it take from 5 to 28 seconds or more for the StopTheWorld to take

[go-nuts] Pause failures, GC, & StopTheWorld

2018-12-11 Thread eric
I am observing pause failures in an application and it appears to me that I’m seeing it take from 5 to 28 seconds or more for the StopTheWorld to take effect for GC. In all likelihood the problem lies in the application, but before I start changing it or attempting to tune GC, I’d like to under