Hi! We have a go program (an api server) on a virtual machine(with 8 cores) with a long time stable running. However, the program recently suffered a weird problem that only a single CPU reached 100% usage while others were very low, in the meanwhile, the network bandwidth was totally zero, also, there were a bunch of tcp connections with CLOSE_WAIT state on the server side. So it seems to me that the program was busily spinning on some events and cannot execute our codes.
We sent a QUIT signal to it and got its goroutine stacks, there were 3000+ goroutines on there, only two goroutines were running but 370 goroutines were runnable, others were blocked on the channel events. Unfortunately, these two gouroutine stacks were not available since the "goroutine running on other thread". We didn't adjust runtime.GOMAXPROCS so the default Ps in Go should be the number of processors, i.e. 8. In my view, the number of running goroutines should be larger, and it seems the runq size was somewhat large (even we have 8 Ms which are running user goroutines, the average runq size is 46, if we only the global runq). I don't know what did other Ms do at that time, I know there is a mark assistant mechanism in the garbage collector implementation. But will it use a log of Ms and make the scheduler in trouble? Go version we use: go/1.12.13. Os we use: CentOS/3.10.0. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/87d85095-a8f1-49e9-b079-1e9fe2089a31%40googlegroups.com.