Out of interest what OS? Is the machine virtualised? You might want to try doing an OS syscall trace to see if it’s stuck in an OS call for some reason.
Regards Steve On Sat, 24 Aug 2019 at 03:37, Michael Andersen <mich...@steelcode.com> wrote: > This is built with go 1.12.5, incidentally, but I have seen this on > several go versions spanning several months. > > On Fri, Aug 23, 2019 at 7:36 PM Michael Andersen <mich...@steelcode.com> > wrote: > >> Ok, so I have more information, and it's not what I would expect. >> >> I added scheddetail=1,schedtrace=2000 so that I had a list of which M's >> and G's were on the P's during the 5 seconds that scheduling stalled. I >> added a sentinel goroutine that sleeps 1 second in a loop and panics if the >> sleep takes more than 4 seconds. As soon as the sentinel wakes up after the >> stall, it dumps all the stacks for all the goroutines under the assumption >> that at least some of them might be in the same place as they were during >> the stall. Thus, I know for sure which goroutines caused the problem and >> maybe which piece of the code they were in. >> >> During the stall, all of the P's had the same M's and the same G's, so >> nothing was changing except the idlethreads and runqueues (as I would >> expect because my sentinel goroutine didn't get run). Here is a compilation >> of the information captured during the stall. >> >> SCHED 3425490ms: gomaxprocs=8 idleprocs=0 threads=52 spinningthreads=0 >> idlethreads=35 runqueue=8 gcwaiting=0 nmidlelocked=1 stopwait=0 sysmonwait=0 >> < same M's on the same P's with the same curg's as below > >> SCHED 3427497ms: gomaxprocs=8 idleprocs=0 threads=52 spinningthreads=0 >> idlethreads=36 runqueue=9 gcwaiting=0 nmidlelocked=1 stopwait=0 sysmonwait=0 >> P0: status=1 schedtick=22978494 syscalltick=6811602 m=44 runqsize=0 >> gfreecnt=2 >> M44: p=0 curg=12934173 mallocing=0 throwing=0 preemptoff= locks=0 >> dying=0 spinning=false blocked=false lockedg=-1 >> G12934173: status=2(chan receive) m=44 lockedm=-1 >> >> P1: status=1 schedtick=22848756 syscalltick=6851102 m=34 runqsize=0 >> gfreecnt=32 >> M34: p=1 curg=12934084 mallocing=0 throwing=0 preemptoff= locks=0 >> dying=0 spinning=false blocked=false lockedg=-1 >> G12934084: status=2(chan receive) m=34 lockedm=-1 >> >> P2: status=1 schedtick=22743131 syscalltick=6730185 m=17 runqsize=0 >> gfreecnt=37 >> M17: p=2 curg=12934032 mallocing=0 throwing=0 preemptoff= locks=0 >> dying=0 spinning=false blocked=false lockedg=-1 >> G12934032: status=2(chan receive) m=17 lockedm=-1 >> >> P3: status=1 schedtick=22674653 syscalltick=6516472 m=11 runqsize=0 >> gfreecnt=20 >> M11: p=3 curg=12934147 mallocing=0 throwing=0 preemptoff= locks=0 >> dying=0 spinning=false blocked=false lockedg=-1 >> G12934147: status=2(chan receive) m=11 lockedm=-1 >> >> P4: status=1 schedtick=22693327 syscalltick=6322345 m=33 runqsize=0 >> gfreecnt=12 >> M33: p=4 curg=12934059 mallocing=0 throwing=0 preemptoff= locks=0 >> dying=0 spinning=false blocked=false lockedg=-1 >> G12934059: status=2(chan receive) m=33 lockedm=-1 >> >> P5: status=1 schedtick=22370152 syscalltick=6219903 m=0 runqsize=0 >> gfreecnt=39 >> M0: p=5 curg=12934131 mallocing=0 throwing=0 preemptoff= locks=0 >> dying=0 spinning=false blocked=false lockedg=-1 >> G12934131: status=2(chan receive) m=0 lockedm=-1 >> >> P6: status=1 schedtick=21910259 syscalltick=6171748 m=51 runqsize=0 >> gfreecnt=39 >> M51: p=6 curg=12934126 mallocing=0 throwing=0 preemptoff= locks=0 >> dying=0 spinning=false blocked=false lockedg=-1 >> G12934126: status=2(chan receive) m=51 lockedm=-1 >> >> P7: status=1 schedtick=22019793 syscalltick=6091894 m=32 runqsize=1 >> gfreecnt=30 >> M32: p=7 curg=12934191 mallocing=0 throwing=0 preemptoff= locks=0 >> dying=0 spinning=false blocked=false lockedg=-1 >> G12934191: status=2(chan receive) m=32 lockedm=-1 >> >> After the sentinel goroutine was scheduled (~1 second after this >> schedtrace output) I captured the stacks and crossreferenced the G's. All >> of them are in time.Now() >> >> goroutine 12934173 [runnable]: >> time.Now(0x7f453a441dc0, 0xc0169fe000, 0x5a0) >> /usr/local/go/src/time/time.go:1087 +0xb2 >> -- line in request handler in my app-- >> >> goroutine 12934084 [runnable]: >> time.Now(0x7f4614a65400, 0xc0155a2000, 0x1680) >> /usr/local/go/src/time/time.go:1087 +0xb2 >> -- line in request handler in my app-- >> >> goroutine 12934032 [runnable]: >> time.Now(0x7f457d74ae20, 0xc023179400, 0x3c0) >> /usr/local/go/src/time/time.go:1087 +0xb2 >> -- line in request handler in my app-- >> >> goroutine 12934147 [runnable]: >> time.Now(0x7f461fcca4c0, 0xc00cf38400, 0x3c0) >> /usr/local/go/src/time/time.go:1087 +0xb2 >> -- line in request handler in my app-- >> >> goroutine 12934059 [runnable]: >> time.Now(0x7f45ff9dcd00, 0xc0245b1000, 0xf00) >> /usr/local/go/src/time/time.go:1087 +0xb2 >> -- line in request handler in my app-- >> >> goroutine 12934131 [runnable]: >> time.Now(0x7f4513453e80, 0xc02a96c000, 0x5a0) >> /usr/local/go/src/time/time.go:1087 +0xb2 >> -- line in request handler in my app-- >> >> goroutine 12934126 [runnable]: >> time.Now(0x7f457fd46e60, 0xc016889200, 0x5a0) >> /usr/local/go/src/time/time.go:1087 +0xb2 >> -- line in request handler in my app-- >> >> goroutine 12934191 [runnable]: >> time.Now(0x7f45088d6b80, 0xc01770ec00, 0x5a0) >> /usr/local/go/src/time/time.go:1087 +0xb2 >> -- line in request handler in my app-- >> >> I went through the path that a goroutine takes to get to that time.Now() >> line and it flows from GRPC through to some simple sanitization. There are >> function calls sprinkled everywhere and no loops that would be >> unpreemtible, so I don't think they all got stuck in some tight loop before >> the time.Now and then somehow all proceeded a few lines so that they ALL >> were on time.Now when I dumped the stacks. I think they were stuck in >> time.Now() during the scheduling stall. >> >> This is running in kubernetes on AWS, although I have seen it on a bare >> metal kubernetes cluster too. >> >> Any ideas where to progress from here? >> >> >> >> On Fri, Aug 23, 2019 at 3:32 PM Ian Lance Taylor <i...@golang.org> wrote: >> >>> On Fri, Aug 23, 2019 at 2:30 PM Michael Andersen <mich...@steelcode.com> >>> wrote: >>> > >>> > Are you suggesting that there might be enough unpreemtable goroutines >>> to fill all the P's? I do have several cgo goroutines sitting in syscalls, >>> but my understanding was that the scheduler would "preempt" that by moving >>> the M off the P and creating a new M. >>> >>> That is how it is supposed to work, yes. That process can take up to >>> 10ms, but that doesn't explain what you are seeing. >>> >>> Ian >> >> >>> -- > You received this message because you are subscribed to the Google Groups > "golang-nuts" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-nuts+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/golang-nuts/CAPLpPruoeZfEHBHrZJQPM1Ynd3u0gZXPbwBfMKb%2BnQb8RwKU-A%40mail.gmail.com > <https://groups.google.com/d/msgid/golang-nuts/CAPLpPruoeZfEHBHrZJQPM1Ynd3u0gZXPbwBfMKb%2BnQb8RwKU-A%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/CAHEMsqYBbeMH0%3DOqS%3D34vnLmOkTsJsWV1VE6Bw8jUwudw54Tew%40mail.gmail.com.