This is built with go 1.12.5, incidentally, but I have seen this on several
go versions spanning several months.

On Fri, Aug 23, 2019 at 7:36 PM Michael Andersen <mich...@steelcode.com>
wrote:

> Ok, so I have more information, and it's not what I would expect.
>
> I added scheddetail=1,schedtrace=2000 so that I had a list of which M's
> and G's were on the P's during the 5 seconds that scheduling stalled. I
> added a sentinel goroutine that sleeps 1 second in a loop and panics if the
> sleep takes more than 4 seconds. As soon as the sentinel wakes up after the
> stall, it dumps all the stacks for all the goroutines under  the assumption
> that at least some of them might be in the same place as they were during
> the stall. Thus, I know for sure which goroutines caused the problem and
> maybe which piece of the code they were in.
>
> During the stall, all of the P's had the same M's and the same G's, so
> nothing was changing except the idlethreads and runqueues (as I would
> expect because my sentinel goroutine didn't get run). Here is a compilation
> of the information captured during the stall.
>
> SCHED 3425490ms: gomaxprocs=8 idleprocs=0 threads=52 spinningthreads=0
> idlethreads=35 runqueue=8 gcwaiting=0 nmidlelocked=1 stopwait=0 sysmonwait=0
>  < same M's on the same P's with the same curg's as below >
> SCHED 3427497ms: gomaxprocs=8 idleprocs=0 threads=52 spinningthreads=0
> idlethreads=36 runqueue=9 gcwaiting=0 nmidlelocked=1 stopwait=0 sysmonwait=0
>   P0: status=1 schedtick=22978494 syscalltick=6811602 m=44 runqsize=0
> gfreecnt=2
>   M44: p=0 curg=12934173 mallocing=0 throwing=0 preemptoff= locks=0
> dying=0 spinning=false blocked=false lockedg=-1
>   G12934173: status=2(chan receive) m=44 lockedm=-1
>
>   P1: status=1 schedtick=22848756 syscalltick=6851102 m=34 runqsize=0
> gfreecnt=32
>   M34: p=1 curg=12934084 mallocing=0 throwing=0 preemptoff= locks=0
> dying=0 spinning=false blocked=false lockedg=-1
>   G12934084: status=2(chan receive) m=34 lockedm=-1
>
>   P2: status=1 schedtick=22743131 syscalltick=6730185 m=17 runqsize=0
> gfreecnt=37
>   M17: p=2 curg=12934032 mallocing=0 throwing=0 preemptoff= locks=0
> dying=0 spinning=false blocked=false lockedg=-1
>   G12934032: status=2(chan receive) m=17 lockedm=-1
>
>   P3: status=1 schedtick=22674653 syscalltick=6516472 m=11 runqsize=0
> gfreecnt=20
>   M11: p=3 curg=12934147 mallocing=0 throwing=0 preemptoff= locks=0
> dying=0 spinning=false blocked=false lockedg=-1
>   G12934147: status=2(chan receive) m=11 lockedm=-1
>
>   P4: status=1 schedtick=22693327 syscalltick=6322345 m=33 runqsize=0
> gfreecnt=12
>   M33: p=4 curg=12934059 mallocing=0 throwing=0 preemptoff= locks=0
> dying=0 spinning=false blocked=false lockedg=-1
>   G12934059: status=2(chan receive) m=33 lockedm=-1
>
>   P5: status=1 schedtick=22370152 syscalltick=6219903 m=0 runqsize=0
> gfreecnt=39
>   M0: p=5 curg=12934131 mallocing=0 throwing=0 preemptoff= locks=0 dying=0
> spinning=false blocked=false lockedg=-1
>   G12934131: status=2(chan receive) m=0 lockedm=-1
>
>   P6: status=1 schedtick=21910259 syscalltick=6171748 m=51 runqsize=0
> gfreecnt=39
>   M51: p=6 curg=12934126 mallocing=0 throwing=0 preemptoff= locks=0
> dying=0 spinning=false blocked=false lockedg=-1
>   G12934126: status=2(chan receive) m=51 lockedm=-1
>
>   P7: status=1 schedtick=22019793 syscalltick=6091894 m=32 runqsize=1
> gfreecnt=30
>   M32: p=7 curg=12934191 mallocing=0 throwing=0 preemptoff= locks=0
> dying=0 spinning=false blocked=false lockedg=-1
>   G12934191: status=2(chan receive) m=32 lockedm=-1
>
> After the sentinel goroutine was scheduled (~1 second after this
> schedtrace output) I captured the stacks and crossreferenced the G's. All
> of them are in time.Now()
>
>   goroutine 12934173 [runnable]:
>   time.Now(0x7f453a441dc0, 0xc0169fe000, 0x5a0)
>   /usr/local/go/src/time/time.go:1087 +0xb2
>   -- line in request handler in my app--
>
>   goroutine 12934084 [runnable]:
>   time.Now(0x7f4614a65400, 0xc0155a2000, 0x1680)
>   /usr/local/go/src/time/time.go:1087 +0xb2
>   -- line in request handler in my app--
>
>   goroutine 12934032 [runnable]:
>   time.Now(0x7f457d74ae20, 0xc023179400, 0x3c0)
>   /usr/local/go/src/time/time.go:1087 +0xb2
>   -- line in request handler in my app--
>
>   goroutine 12934147 [runnable]:
>   time.Now(0x7f461fcca4c0, 0xc00cf38400, 0x3c0)
>   /usr/local/go/src/time/time.go:1087 +0xb2
>   -- line in request handler in my app--
>
>   goroutine 12934059 [runnable]:
>   time.Now(0x7f45ff9dcd00, 0xc0245b1000, 0xf00)
>   /usr/local/go/src/time/time.go:1087 +0xb2
>   -- line in request handler in my app--
>
>   goroutine 12934131 [runnable]:
>   time.Now(0x7f4513453e80, 0xc02a96c000, 0x5a0)
>   /usr/local/go/src/time/time.go:1087 +0xb2
>   -- line in request handler in my app--
>
>   goroutine 12934126 [runnable]:
>   time.Now(0x7f457fd46e60, 0xc016889200, 0x5a0)
>   /usr/local/go/src/time/time.go:1087 +0xb2
>   -- line in request handler in my app--
>
>   goroutine 12934191 [runnable]:
>   time.Now(0x7f45088d6b80, 0xc01770ec00, 0x5a0)
>   /usr/local/go/src/time/time.go:1087 +0xb2
>   -- line in request handler in my app--
>
> I went through the path that a goroutine takes to get to that time.Now()
> line and it flows from GRPC through to some simple sanitization. There are
> function calls sprinkled everywhere and no loops that would be
> unpreemtible, so I don't think they all got stuck in some tight loop before
> the time.Now and then somehow all proceeded a few lines so that they ALL
> were on time.Now when I dumped the stacks. I think they were stuck in
> time.Now() during the scheduling stall.
>
> This is running in kubernetes on AWS, although I have seen it on a bare
> metal kubernetes cluster too.
>
> Any ideas where to progress from here?
>
>
>
> On Fri, Aug 23, 2019 at 3:32 PM Ian Lance Taylor <i...@golang.org> wrote:
>
>> On Fri, Aug 23, 2019 at 2:30 PM Michael Andersen <mich...@steelcode.com>
>> wrote:
>> >
>> > Are you suggesting that there might be enough unpreemtable goroutines
>> to fill all the P's? I do have several cgo goroutines sitting in syscalls,
>> but my understanding was that the scheduler would "preempt" that by moving
>> the M off the P and creating a new M.
>>
>> That is how it is supposed to work, yes.  That process can take up to
>> 10ms, but that doesn't explain what you are seeing.
>>
>> Ian
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CAPLpPruoeZfEHBHrZJQPM1Ynd3u0gZXPbwBfMKb%2BnQb8RwKU-A%40mail.gmail.com.

Reply via email to