On Fri, Feb 23, 2018 at 1:19 AM, Michael Andersen <mich...@steelcode.com> wrote: > > Ok I found the problem. A tight loop was indeed preventing the garbage > collector from proceeding, and then the GC was stopping everything else. The > loop contained atomic.LoadUint64 and atomic.CompareAndSwapUint64, but > neither of those seem to qualify as preemption points. Fixing that loop > fixes everything. > > Thanks to Ian for pointing me on the right path and thanks to Florin Pățan > on gophers.slack.com for pointing me at Delve which eventually let me find > the problematic code. > > I know this is not at all the job of the race detector, but given that > compiling with -race is the first thing we do when a program experiences > "deadlock" type behavior, and I have seen several people being thrown off by > the way go behaves when there are tight loops, can we not detect that the GC > has stalled for way too long and print something out? Like "Goroutine #545 > is preventing GC" or something? Just a thought.
That seems like a good suggestion. Want to open an issue for it at https://golang.org/issue? Thanks. There is a pretty good chance that the general preemptible loop problem will be fixed in 1.11. Ian > On Thu, Feb 22, 2018 at 11:09 PM, Michael Andersen <mich...@steelcode.com> > wrote: >> >> I have found that GODEBUG=schedtrace still prints when the program is >> frozen but I cannot find much documentation on deciphering its output. The >> full output is here: >> https://gist.github.com/immesys/7f213c63c54ce60ba72d1cbc9ffcc4cb but here >> are the first few lines. It looks like this is garbage collector related? >> >> SCHED 111364ms: gomaxprocs=40 idleprocs=0 threads=78 spinningthreads=0 >> idlethreads=65 runqueue=31 gcwaiting=1 nmidlelocked=1 stopwait=1 >> sysmonwait=0 >> P0: status=3 schedtick=36699 syscalltick=29158 m=-1 runqsize=0 gfreecnt=45 >> P1: status=3 schedtick=37370 syscalltick=30563 m=-1 runqsize=0 gfreecnt=31 >> P2: status=3 schedtick=36650 syscalltick=29478 m=-1 runqsize=0 gfreecnt=17 >> P3: status=3 schedtick=42395 syscalltick=34028 m=-1 runqsize=0 gfreecnt=38 >> P4: status=3 schedtick=40531 syscalltick=33546 m=-1 runqsize=0 gfreecnt=27 >> P5: status=3 schedtick=41369 syscalltick=31884 m=-1 runqsize=0 gfreecnt=19 >> P6: status=3 schedtick=39302 syscalltick=32162 m=-1 runqsize=0 gfreecnt=39 >> P7: status=3 schedtick=39739 syscalltick=32624 m=-1 runqsize=0 gfreecnt=34 >> P8: status=3 schedtick=32585 syscalltick=24358 m=-1 runqsize=0 gfreecnt=45 >> P9: status=3 schedtick=40376 syscalltick=30399 m=-1 runqsize=0 gfreecnt=27 >> P10: status=3 schedtick=31018 syscalltick=22164 m=-1 runqsize=0 >> gfreecnt=40 >> P11: status=3 schedtick=32043 syscalltick=24479 m=-1 runqsize=0 >> gfreecnt=53 >> P12: status=3 schedtick=36459 syscalltick=25222 m=-1 runqsize=0 >> gfreecnt=21 >> P13: status=3 schedtick=38445 syscalltick=28317 m=-1 runqsize=0 >> gfreecnt=38 >> P14: status=3 schedtick=39533 syscalltick=28189 m=-1 runqsize=0 >> gfreecnt=29 >> P15: status=3 schedtick=37009 syscalltick=27231 m=-1 runqsize=0 >> gfreecnt=29 >> P16: status=3 schedtick=36957 syscalltick=26211 m=44 runqsize=0 >> gfreecnt=45 >> P17: status=3 schedtick=35632 syscalltick=26481 m=-1 runqsize=0 >> gfreecnt=48 >> P18: status=3 schedtick=36228 syscalltick=24751 m=-1 runqsize=0 >> gfreecnt=59 >> P19: status=3 schedtick=32479 syscalltick=22970 m=-1 runqsize=0 >> gfreecnt=48 >> P20: status=3 schedtick=34090 syscalltick=25864 m=-1 runqsize=0 >> gfreecnt=12 >> P21: status=3 schedtick=35409 syscalltick=24369 m=-1 runqsize=0 gfreecnt=6 >> P22: status=3 schedtick=42235 syscalltick=28607 m=-1 runqsize=0 >> gfreecnt=16 >> P23: status=3 schedtick=29733 syscalltick=19429 m=-1 runqsize=0 >> gfreecnt=43 >> P24: status=3 schedtick=34757 syscalltick=22141 m=-1 runqsize=0 >> gfreecnt=26 >> P25: status=3 schedtick=31434 syscalltick=21767 m=-1 runqsize=0 >> gfreecnt=23 >> P26: status=3 schedtick=31597 syscalltick=21657 m=-1 runqsize=0 >> gfreecnt=53 >> P27: status=3 schedtick=32539 syscalltick=22653 m=-1 runqsize=0 >> gfreecnt=34 >> P28: status=3 schedtick=29503 syscalltick=20588 m=-1 runqsize=0 >> gfreecnt=28 >> P29: status=3 schedtick=30842 syscalltick=20199 m=-1 runqsize=0 >> gfreecnt=41 >> P30: status=3 schedtick=29787 syscalltick=19492 m=-1 runqsize=0 >> gfreecnt=15 >> P31: status=3 schedtick=28226 syscalltick=19101 m=-1 runqsize=0 >> gfreecnt=32 >> P32: status=3 schedtick=34254 syscalltick=24135 m=-1 runqsize=0 >> gfreecnt=27 >> P33: status=3 schedtick=34567 syscalltick=22554 m=-1 runqsize=0 >> gfreecnt=36 >> P34: status=3 schedtick=30416 syscalltick=19722 m=-1 runqsize=0 >> gfreecnt=33 >> P35: status=1 schedtick=19536 syscalltick=13084 m=0 runqsize=0 gfreecnt=50 >> P36: status=3 schedtick=29336 syscalltick=19015 m=-1 runqsize=0 >> gfreecnt=28 >> P37: status=3 schedtick=29794 syscalltick=19499 m=-1 runqsize=0 >> gfreecnt=40 >> P38: status=3 schedtick=31748 syscalltick=20359 m=-1 runqsize=0 >> gfreecnt=35 >> P39: status=3 schedtick=31851 syscalltick=20260 m=-1 runqsize=0 >> gfreecnt=46 >> >> >> >> On Thu, Feb 22, 2018 at 8:47 PM, Michael Andersen <mich...@steelcode.com> >> wrote: >>> >>> Hi >>> >>> Thanks for your suggestions. This freeze happens to coincide with a >>> larger number of cgo calls which in turn most likely do blocking read/write >>> from sockets. How are those treated by the scheduler? >>> >>> I am not doing anything FUSE related, nor do I have assembly code. The >>> only "interesting" stuff is the use of a c library - librados. >>> >>> In the goroutine dump from SIGQUIT, some stacks are followed by register >>> dumps. Are those the only ones currently running? All of those are in >>> runtime: 16 are in runtime.futex, 2 are in runtime.notetsleepg and one in >>> runtime.gopark, which would make it seem like it was not an un-preemptable >>> tight loop, but I am not sure how to parse the SIGQUIT output. >>> >>> Thanks >>> Michael >>> >>>> >>>> I don't know what is happening with your program. >>>> >>>> This kind of thing can happen if you have a goroutine that is running >>>> in a tight loop with no function calls or memory allocations. The >>>> current Go scheduler is non-preemptive, meaning that nothing will stop >>>> that loop. If that loop occurs while holding a lock, it could block >>>> the entire rest of the program from running. However, you would see >>>> this in the stack trace. This problem with the current scheduler is >>>> https://golang.org/issue/10958. >>>> >>>> This kind of thing can happen if you are using an in-process FUSE file >>>> system implemented by goroutines in your program. The Go runtime >>>> believes that some system calls, such as pipe or socket, never block. >>>> If you have somehow set things up so that those system calls enter >>>> your FUSE file system and depend on some other goroutine running, it >>>> is possible that that goroutine will never be scheduled. I don't know >>>> of any bugs like this at present but they have existed in the past. >>>> Of course if you aren't using a FUSE file system then this is not the >>>> problem. >>>> >>>> This kind of thing can happen if you use assembler code to do a >>>> blocking operation, or if you use syscall.Rawsyscall to call a system >>>> call that blocks. That can confuse the scheduler and lead to a >>>> deadlock. Don't do that. >>>> >>>> None of these are likely, but you asked for suggestions, and that's >>>> what I've come up with. >>>> >>>> Ian >>> >>> >> > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.