This is built with go 1.12.5, incidentally, but I have seen this on several go versions spanning several months.
On Fri, Aug 23, 2019 at 7:36 PM Michael Andersen <mich...@steelcode.com> wrote: > Ok, so I have more information, and it's not what I would expect. > > I added scheddetail=1,schedtrace=2000 so that I had a list of which M's > and G's were on the P's during the 5 seconds that scheduling stalled. I > added a sentinel goroutine that sleeps 1 second in a loop and panics if the > sleep takes more than 4 seconds. As soon as the sentinel wakes up after the > stall, it dumps all the stacks for all the goroutines under the assumption > that at least some of them might be in the same place as they were during > the stall. Thus, I know for sure which goroutines caused the problem and > maybe which piece of the code they were in. > > During the stall, all of the P's had the same M's and the same G's, so > nothing was changing except the idlethreads and runqueues (as I would > expect because my sentinel goroutine didn't get run). Here is a compilation > of the information captured during the stall. > > SCHED 3425490ms: gomaxprocs=8 idleprocs=0 threads=52 spinningthreads=0 > idlethreads=35 runqueue=8 gcwaiting=0 nmidlelocked=1 stopwait=0 sysmonwait=0 > < same M's on the same P's with the same curg's as below > > SCHED 3427497ms: gomaxprocs=8 idleprocs=0 threads=52 spinningthreads=0 > idlethreads=36 runqueue=9 gcwaiting=0 nmidlelocked=1 stopwait=0 sysmonwait=0 > P0: status=1 schedtick=22978494 syscalltick=6811602 m=44 runqsize=0 > gfreecnt=2 > M44: p=0 curg=12934173 mallocing=0 throwing=0 preemptoff= locks=0 > dying=0 spinning=false blocked=false lockedg=-1 > G12934173: status=2(chan receive) m=44 lockedm=-1 > > P1: status=1 schedtick=22848756 syscalltick=6851102 m=34 runqsize=0 > gfreecnt=32 > M34: p=1 curg=12934084 mallocing=0 throwing=0 preemptoff= locks=0 > dying=0 spinning=false blocked=false lockedg=-1 > G12934084: status=2(chan receive) m=34 lockedm=-1 > > P2: status=1 schedtick=22743131 syscalltick=6730185 m=17 runqsize=0 > gfreecnt=37 > M17: p=2 curg=12934032 mallocing=0 throwing=0 preemptoff= locks=0 > dying=0 spinning=false blocked=false lockedg=-1 > G12934032: status=2(chan receive) m=17 lockedm=-1 > > P3: status=1 schedtick=22674653 syscalltick=6516472 m=11 runqsize=0 > gfreecnt=20 > M11: p=3 curg=12934147 mallocing=0 throwing=0 preemptoff= locks=0 > dying=0 spinning=false blocked=false lockedg=-1 > G12934147: status=2(chan receive) m=11 lockedm=-1 > > P4: status=1 schedtick=22693327 syscalltick=6322345 m=33 runqsize=0 > gfreecnt=12 > M33: p=4 curg=12934059 mallocing=0 throwing=0 preemptoff= locks=0 > dying=0 spinning=false blocked=false lockedg=-1 > G12934059: status=2(chan receive) m=33 lockedm=-1 > > P5: status=1 schedtick=22370152 syscalltick=6219903 m=0 runqsize=0 > gfreecnt=39 > M0: p=5 curg=12934131 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 > spinning=false blocked=false lockedg=-1 > G12934131: status=2(chan receive) m=0 lockedm=-1 > > P6: status=1 schedtick=21910259 syscalltick=6171748 m=51 runqsize=0 > gfreecnt=39 > M51: p=6 curg=12934126 mallocing=0 throwing=0 preemptoff= locks=0 > dying=0 spinning=false blocked=false lockedg=-1 > G12934126: status=2(chan receive) m=51 lockedm=-1 > > P7: status=1 schedtick=22019793 syscalltick=6091894 m=32 runqsize=1 > gfreecnt=30 > M32: p=7 curg=12934191 mallocing=0 throwing=0 preemptoff= locks=0 > dying=0 spinning=false blocked=false lockedg=-1 > G12934191: status=2(chan receive) m=32 lockedm=-1 > > After the sentinel goroutine was scheduled (~1 second after this > schedtrace output) I captured the stacks and crossreferenced the G's. All > of them are in time.Now() > > goroutine 12934173 [runnable]: > time.Now(0x7f453a441dc0, 0xc0169fe000, 0x5a0) > /usr/local/go/src/time/time.go:1087 +0xb2 > -- line in request handler in my app-- > > goroutine 12934084 [runnable]: > time.Now(0x7f4614a65400, 0xc0155a2000, 0x1680) > /usr/local/go/src/time/time.go:1087 +0xb2 > -- line in request handler in my app-- > > goroutine 12934032 [runnable]: > time.Now(0x7f457d74ae20, 0xc023179400, 0x3c0) > /usr/local/go/src/time/time.go:1087 +0xb2 > -- line in request handler in my app-- > > goroutine 12934147 [runnable]: > time.Now(0x7f461fcca4c0, 0xc00cf38400, 0x3c0) > /usr/local/go/src/time/time.go:1087 +0xb2 > -- line in request handler in my app-- > > goroutine 12934059 [runnable]: > time.Now(0x7f45ff9dcd00, 0xc0245b1000, 0xf00) > /usr/local/go/src/time/time.go:1087 +0xb2 > -- line in request handler in my app-- > > goroutine 12934131 [runnable]: > time.Now(0x7f4513453e80, 0xc02a96c000, 0x5a0) > /usr/local/go/src/time/time.go:1087 +0xb2 > -- line in request handler in my app-- > > goroutine 12934126 [runnable]: > time.Now(0x7f457fd46e60, 0xc016889200, 0x5a0) > /usr/local/go/src/time/time.go:1087 +0xb2 > -- line in request handler in my app-- > > goroutine 12934191 [runnable]: > time.Now(0x7f45088d6b80, 0xc01770ec00, 0x5a0) > /usr/local/go/src/time/time.go:1087 +0xb2 > -- line in request handler in my app-- > > I went through the path that a goroutine takes to get to that time.Now() > line and it flows from GRPC through to some simple sanitization. There are > function calls sprinkled everywhere and no loops that would be > unpreemtible, so I don't think they all got stuck in some tight loop before > the time.Now and then somehow all proceeded a few lines so that they ALL > were on time.Now when I dumped the stacks. I think they were stuck in > time.Now() during the scheduling stall. > > This is running in kubernetes on AWS, although I have seen it on a bare > metal kubernetes cluster too. > > Any ideas where to progress from here? > > > > On Fri, Aug 23, 2019 at 3:32 PM Ian Lance Taylor <i...@golang.org> wrote: > >> On Fri, Aug 23, 2019 at 2:30 PM Michael Andersen <mich...@steelcode.com> >> wrote: >> > >> > Are you suggesting that there might be enough unpreemtable goroutines >> to fill all the P's? I do have several cgo goroutines sitting in syscalls, >> but my understanding was that the scheduler would "preempt" that by moving >> the M off the P and creating a new M. >> >> That is how it is supposed to work, yes. That process can take up to >> 10ms, but that doesn't explain what you are seeing. >> >> Ian >> > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/CAPLpPruoeZfEHBHrZJQPM1Ynd3u0gZXPbwBfMKb%2BnQb8RwKU-A%40mail.gmail.com.