On Thu, Aug 22, 2019 at 5:09 PM <mich...@steelcode.com> wrote:
>
> I have a fairly complex program that has about 150 goroutines servicing 
> requests. Some requests go out to cgo, which does some network IO. There is 
> one goroutine that is responsible for a heartbeat (etcd keepalive) and 
> sometimes (~every two weeks) this goroutine doesn't get scheduled for long 
> periods of time (more than ten seconds), so it misses the heartbeat.
>
> I've been debugging this on and off for several months, and am completely 
> sure it relates to scheduling because there is code that looks similar to 
> this this:
>
> fmt.Printf("before %v\n",time.Now())
> <-time.After(500*time.Millisecond)
> fmt.Printf("after %v\n",time.Now())
>
> And the print statements show times more than ten seconds apart.
>
> I considered it might be a very long GC stop the world (maybe some rogue 
> unpreemptible goroutine) but I don't see any output from gctrace=1 that 
> indicates that (I could be wrong)
>
> The goroutines in cgo can take some time to complete their operations 
> (normally tens of milliseconds, sometimes tens of seconds) but I have never 
> observed more than 200 processes being created in response to the blocking M 
> being moved off the P and a new M being created.
>
> The total CPU usage doesn't exceed 60%, so there are spare CPU cycles.
>
> I turned on scheduling trace, and here is the output from a problematic event:
> # this is where we select on <-time.After(500ms)
> [etcddebug] start of select 2019-08-01 06:29:55.873680674 +0000 UTC 
> m=+132838.232219823
> SCHED 132838304ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
> idlethreads=102 runqueue=2 [0 0 3 0 0 0 0 0]
> [INFO] 2019-08-01T06:29:56.084Z <some log output from another goroutine>
> SCHED 132839312ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
> idlethreads=105 runqueue=9 [0 0 0 0 5 0 0 0]
> <some output from another goroutine>
> SCHED 132840318ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
> idlethreads=106 runqueue=12 [0 0 0 0 5 0 0 0]
> SCHED 132841325ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
> idlethreads=106 runqueue=12 [0 0 0 0 5 0 0 0]
> SCHED 132842331ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
> idlethreads=107 runqueue=13 [0 0 0 0 5 0 0 0]
> SCHED 132843338ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
> idlethreads=107 runqueue=13 [0 0 0 0 5 0 0 0]
> SCHED 132844344ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
> idlethreads=107 runqueue=13 [0 0 0 0 5 0 0 0]
> SCHED 132845351ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
> idlethreads=107 runqueue=14 [0 0 0 0 5 0 0 0]
> SCHED 132846357ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
> idlethreads=107 runqueue=14 [0 0 0 0 5 0 0 0]
> SCHED 132847364ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
> idlethreads=107 runqueue=15 [0 0 0 0 5 0 0 0]
> # This is where the program gets killed because the <-time.After goroutine 
> never ran
> [ERROR] 2019-08-01T06:30:05.472Z clustering.go:259 ETCD LEASE CHANNEL CLOSED
>
>
> The program panics soon after that message (on purpose) and I get this 
> printout from sched trace. Not sure what triggers it.
>
> SCHED 132847848ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
> idlethreads=97 runqueue=2 gcwaiting=0 nmidlelocked=1 stopwait=0 sysmonwait=0
> P0: status=2 schedtick=714575223 syscalltick=286704429 m=-1 runqsize=0 
> gfreecnt=41
> #this is the <-time.After goroutine finally running
> [etcddebug] time.After 2019-08-01 06:30:05.473046143 +0000 UTC 
> m=+132847.831585336 (227)
> P1: status=1 schedtick=714230235 syscalltick=286072819 m=36 runqsize=0 
> gfreecnt=11
> P2: status=2 schedtick=710948022 syscalltick=284960027 m=-1 runqsize=0 
> gfreecnt=53
> P3: status=1 schedtick=707003392 syscalltick=283311908 m=88 runqsize=1 
> gfreecnt=53
> P4: status=2 schedtick=698758214 syscalltick=280825500 m=-1 runqsize=0 
> gfreecnt=23
> P5: status=2 schedtick=690236090 syscalltick=278745170 m=-1 runqsize=0 
> gfreecnt=38
> P6: status=1 schedtick=685940734 syscalltick=276797704 m=77 runqsize=0 
> gfreecnt=51
> P7: status=2 schedtick=681779831 syscalltick=275239769 m=-1 runqsize=0 
> gfreecnt=50
>
>
>
> I've tried several times to create a simple reproducing program, and have 
> failed. I'm reaching out to ask if anyone could help me interpret the output 
> from the scheduling trace, or perhaps even suggest what might be causing 
> this, so I can create a standalone reproducing program.


One possibility is an unpremptible loop.  Can you get a stack trace at
the point where the timeout fails?  Perhaps where you print ERROR
above you could call runtime.Stack(buf, true) and print the output.
Look at the code that you see from that output and see if it any of it
might be in an unpreemtible loop.

Ian

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CAOyqgcUk%3DgqMnPHLNSObmZdAvv2uzk9ZpmaeBpZh8kdGpcmwOA%40mail.gmail.com.

Reply via email to