On Fri, Feb 23, 2018 at 1:19 AM, Michael Andersen <mich...@steelcode.com> wrote:
>
> Ok I found the problem. A tight loop was indeed preventing the garbage
> collector from proceeding, and then the GC was stopping everything else. The
> loop contained atomic.LoadUint64 and atomic.CompareAndSwapUint64, but
> neither of those seem to qualify as preemption points. Fixing that loop
> fixes everything.
>
> Thanks to Ian for pointing me on the right path and thanks to Florin Pățan
> on gophers.slack.com for pointing me at Delve which eventually let me find
> the problematic code.
>
> I know this is not at all the job of the race detector, but given that
> compiling with -race is the first thing we do when a program experiences
> "deadlock" type behavior, and I have seen several people being thrown off by
> the way go behaves when there are tight loops, can we not detect that the GC
> has stalled for way too long and print something out? Like "Goroutine #545
> is preventing GC" or something? Just a thought.

That seems like a good suggestion.  Want to open an issue for it at
https://golang.org/issue?  Thanks.

There is a pretty good chance that the general preemptible loop
problem will be fixed in 1.11.

Ian




> On Thu, Feb 22, 2018 at 11:09 PM, Michael Andersen <mich...@steelcode.com>
> wrote:
>>
>> I have found that GODEBUG=schedtrace still prints when the program is
>> frozen but I cannot find much documentation on deciphering its output. The
>> full output is here:
>> https://gist.github.com/immesys/7f213c63c54ce60ba72d1cbc9ffcc4cb but here
>> are the first few lines. It looks like this is garbage collector related?
>>
>> SCHED 111364ms: gomaxprocs=40 idleprocs=0 threads=78 spinningthreads=0
>> idlethreads=65 runqueue=31 gcwaiting=1 nmidlelocked=1 stopwait=1
>> sysmonwait=0
>> P0: status=3 schedtick=36699 syscalltick=29158 m=-1 runqsize=0 gfreecnt=45
>> P1: status=3 schedtick=37370 syscalltick=30563 m=-1 runqsize=0 gfreecnt=31
>> P2: status=3 schedtick=36650 syscalltick=29478 m=-1 runqsize=0 gfreecnt=17
>> P3: status=3 schedtick=42395 syscalltick=34028 m=-1 runqsize=0 gfreecnt=38
>> P4: status=3 schedtick=40531 syscalltick=33546 m=-1 runqsize=0 gfreecnt=27
>> P5: status=3 schedtick=41369 syscalltick=31884 m=-1 runqsize=0 gfreecnt=19
>> P6: status=3 schedtick=39302 syscalltick=32162 m=-1 runqsize=0 gfreecnt=39
>> P7: status=3 schedtick=39739 syscalltick=32624 m=-1 runqsize=0 gfreecnt=34
>> P8: status=3 schedtick=32585 syscalltick=24358 m=-1 runqsize=0 gfreecnt=45
>> P9: status=3 schedtick=40376 syscalltick=30399 m=-1 runqsize=0 gfreecnt=27
>> P10: status=3 schedtick=31018 syscalltick=22164 m=-1 runqsize=0
>> gfreecnt=40
>> P11: status=3 schedtick=32043 syscalltick=24479 m=-1 runqsize=0
>> gfreecnt=53
>> P12: status=3 schedtick=36459 syscalltick=25222 m=-1 runqsize=0
>> gfreecnt=21
>> P13: status=3 schedtick=38445 syscalltick=28317 m=-1 runqsize=0
>> gfreecnt=38
>> P14: status=3 schedtick=39533 syscalltick=28189 m=-1 runqsize=0
>> gfreecnt=29
>> P15: status=3 schedtick=37009 syscalltick=27231 m=-1 runqsize=0
>> gfreecnt=29
>> P16: status=3 schedtick=36957 syscalltick=26211 m=44 runqsize=0
>> gfreecnt=45
>> P17: status=3 schedtick=35632 syscalltick=26481 m=-1 runqsize=0
>> gfreecnt=48
>> P18: status=3 schedtick=36228 syscalltick=24751 m=-1 runqsize=0
>> gfreecnt=59
>> P19: status=3 schedtick=32479 syscalltick=22970 m=-1 runqsize=0
>> gfreecnt=48
>> P20: status=3 schedtick=34090 syscalltick=25864 m=-1 runqsize=0
>> gfreecnt=12
>> P21: status=3 schedtick=35409 syscalltick=24369 m=-1 runqsize=0 gfreecnt=6
>> P22: status=3 schedtick=42235 syscalltick=28607 m=-1 runqsize=0
>> gfreecnt=16
>> P23: status=3 schedtick=29733 syscalltick=19429 m=-1 runqsize=0
>> gfreecnt=43
>> P24: status=3 schedtick=34757 syscalltick=22141 m=-1 runqsize=0
>> gfreecnt=26
>> P25: status=3 schedtick=31434 syscalltick=21767 m=-1 runqsize=0
>> gfreecnt=23
>> P26: status=3 schedtick=31597 syscalltick=21657 m=-1 runqsize=0
>> gfreecnt=53
>> P27: status=3 schedtick=32539 syscalltick=22653 m=-1 runqsize=0
>> gfreecnt=34
>> P28: status=3 schedtick=29503 syscalltick=20588 m=-1 runqsize=0
>> gfreecnt=28
>> P29: status=3 schedtick=30842 syscalltick=20199 m=-1 runqsize=0
>> gfreecnt=41
>> P30: status=3 schedtick=29787 syscalltick=19492 m=-1 runqsize=0
>> gfreecnt=15
>> P31: status=3 schedtick=28226 syscalltick=19101 m=-1 runqsize=0
>> gfreecnt=32
>> P32: status=3 schedtick=34254 syscalltick=24135 m=-1 runqsize=0
>> gfreecnt=27
>> P33: status=3 schedtick=34567 syscalltick=22554 m=-1 runqsize=0
>> gfreecnt=36
>> P34: status=3 schedtick=30416 syscalltick=19722 m=-1 runqsize=0
>> gfreecnt=33
>> P35: status=1 schedtick=19536 syscalltick=13084 m=0 runqsize=0 gfreecnt=50
>> P36: status=3 schedtick=29336 syscalltick=19015 m=-1 runqsize=0
>> gfreecnt=28
>> P37: status=3 schedtick=29794 syscalltick=19499 m=-1 runqsize=0
>> gfreecnt=40
>> P38: status=3 schedtick=31748 syscalltick=20359 m=-1 runqsize=0
>> gfreecnt=35
>> P39: status=3 schedtick=31851 syscalltick=20260 m=-1 runqsize=0
>> gfreecnt=46
>>
>>
>>
>> On Thu, Feb 22, 2018 at 8:47 PM, Michael Andersen <mich...@steelcode.com>
>> wrote:
>>>
>>> Hi
>>>
>>> Thanks for your suggestions. This freeze happens to coincide with a
>>> larger number of cgo calls which in turn most likely do blocking read/write
>>> from sockets. How are those treated by the scheduler?
>>>
>>> I am not doing anything FUSE related, nor do I have assembly code. The
>>> only "interesting" stuff is the use of a c library - librados.
>>>
>>> In the goroutine dump from SIGQUIT, some stacks are followed by register
>>> dumps. Are those the only ones currently running? All of those are in
>>> runtime: 16 are in runtime.futex, 2 are in runtime.notetsleepg and one in
>>> runtime.gopark, which would make it seem like it was not an un-preemptable
>>> tight loop, but I am not sure how to parse the SIGQUIT output.
>>>
>>> Thanks
>>> Michael
>>>
>>>>
>>>> I don't know what is happening with your program.
>>>>
>>>> This kind of thing can happen if you have a goroutine that is running
>>>> in a tight loop with no function calls or memory allocations.  The
>>>> current Go scheduler is non-preemptive, meaning that nothing will stop
>>>> that loop.  If that loop occurs while holding a lock, it could block
>>>> the entire rest of the program from running.  However, you would see
>>>> this in the stack trace.  This problem with the current scheduler is
>>>> https://golang.org/issue/10958.
>>>>
>>>> This kind of thing can happen if you are using an in-process FUSE file
>>>> system implemented by goroutines in your program.  The Go runtime
>>>> believes that some system calls, such as pipe or socket, never block.
>>>> If you have somehow set things up so that those system calls enter
>>>> your FUSE file system and depend on some other goroutine running, it
>>>> is possible that that goroutine will never be scheduled.  I don't know
>>>> of any bugs like this at present but they have existed in the past.
>>>> Of course if you aren't using a FUSE file system then this is not the
>>>> problem.
>>>>
>>>> This kind of thing can happen if you use assembler code to do a
>>>> blocking operation, or if you use syscall.Rawsyscall to call a system
>>>> call that blocks.  That can confuse the scheduler and lead to a
>>>> deadlock.  Don't do that.
>>>>
>>>> None of these are likely, but you asked for suggestions, and that's
>>>> what I've come up with.
>>>>
>>>> Ian
>>>
>>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to