On Thu, Feb 22, 2018 at 5:48 PM, Michael Andersen <mich...@steelcode.com> wrote: > > I have a complex program that when under load will very reproducibly freeze > every goroutine simultaneously. It then makes no progress at all, even if > left for hours. I'm posting here because I don't know of anything that can > cause this behavior so I don't even know where to begin debugging. When it > "freezes", every goroutine appears to be no longer scheduled, no matter how > simple. Even this at the start of main() ceases to print to stdout: > > go func() { > for { > time.Sleep(3 * time.Second) > fmt.Printf("Still alive\n") > } > }() > > The system is nowhere near OOM, the goroutine count is large but reasonable > just before the freeze (<2k). After it freezes the process is still running, > and attaching sysdig shows it is stuck spinning in futex, with only this > showing up over and over: > > 637779 17:21:56.254826712 20 prog (43085) < futex res=-110(ETIMEDOUT) > 637782 17:21:56.254827305 20 prog (43085) > futex addr=10D5FA0 > op=0(FUTEX_WAIT) val=0 > 637783 17:21:56.254828132 20 prog (43085) > switch next=0 pgft_maj=0 > pgft_min=60361 vm_size=20710168 vm_rss=10792276 vm_swap=0 > > The "frozen" program still responds to SIGQUIT and dumps out the goroutines, > but given that this is not a minimal reproducer (which I have not managed to > make) I don't know which parts of that are useful. I put all of it here: > https://gist.github.com/immesys/0b741e4ea18979614d8419fa9c007098 . > > My main question is what sort of bugs can cause the whole program to lock > up? Even if some goroutines were deadlocked, why would that stop everything > from net/http/pprof to a printf loop from working? > > Some tidbits: > > I have a core dump so I can inspect things with delve if I know what I am > looking for > Building/running with -race doesn't print anything > I came across this > (https://groups.google.com/forum/#!msg/golang-nuts/PMm8nH0yaoA/mb-cnKmZlb4J) > which describes a similar occurency but I don't interact with syslog, at > least not directly. > I am getting this on go 1.10 but I rebuilt on 1.9.4 and I get the same > behavior. > I am on linux amd64 kernel 4.10 > It only takes about two minutes to reproduce. > When frozen, only a single CPU core is pegged, the rest of the system is > fine.
I don't know what is happening with your program. This kind of thing can happen if you have a goroutine that is running in a tight loop with no function calls or memory allocations. The current Go scheduler is non-preemptive, meaning that nothing will stop that loop. If that loop occurs while holding a lock, it could block the entire rest of the program from running. However, you would see this in the stack trace. This problem with the current scheduler is https://golang.org/issue/10958. This kind of thing can happen if you are using an in-process FUSE file system implemented by goroutines in your program. The Go runtime believes that some system calls, such as pipe or socket, never block. If you have somehow set things up so that those system calls enter your FUSE file system and depend on some other goroutine running, it is possible that that goroutine will never be scheduled. I don't know of any bugs like this at present but they have existed in the past. Of course if you aren't using a FUSE file system then this is not the problem. This kind of thing can happen if you use assembler code to do a blocking operation, or if you use syscall.Rawsyscall to call a system call that blocks. That can confuse the scheduler and lead to a deadlock. Don't do that. None of these are likely, but you asked for suggestions, and that's what I've come up with. Ian -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.