Typo. >From backtrace we have from sigabort we see only *2* such data collector go > routine blocked and other 2 2 mins thread waiting on waitgroup.
On Wed, Aug 26, 2020 at 8:50 PM Siddhesh Divekar <siddhesh.dive...@gmail.com> wrote: > Right, then it looks less likely that we are blocked on a mutex. > > Every 2 minutes we spin up a go routine which then in turn spins up a > bunch of go routines to collect data from big query & elastic (data > collector routines). > The 2 minutes go routine then in turn waits on waitgroup for data > collector routines to come back. So its not single go routine from our side > at least. > From backtrace we have from sigabort we see only one such data collector > go routine blocked and other 2 2 mins thread waiting on waitgroup. > > I see similar issues on the internet. Kubernetes hit the following issue > which has exactly the same signature as us. > > https://github.com/kubernetes/kubernetes/issues/53191 > > I will follow up with big query guys to see what their behavior is. If one > of the requests gets stuck on big query, what do they do with > other requests? > But in that case in our back trace we should see all other requests > waiting in their code which is not the case. > > > > On Wed, Aug 26, 2020 at 7:44 PM Robert Engels <reng...@ix.netcom.com> > wrote: > >> If you look at the stack trace the futex is because it is trying to >> shutdown the entire process - killing all of the M’s - probably because you >> sent the sigabort >> >> On Aug 26, 2020, at 9:41 PM, Robert Engels <reng...@ix.netcom.com> wrote: >> >> >> The big query client may be “single threaded” meaning you may need a pool >> of connections. Not sure - haven’t used the BQ library from Go. >> >> If that’s the case a single slow query will block all of your connections >> (if they are all accessing BQ). Plus there are BQ rate limits - when you >> hit those the connector may block - again blocking all clients. >> >> On Aug 26, 2020, at 8:27 PM, Siddhesh Divekar <siddhesh.dive...@gmail.com> >> wrote: >> >> >> Robert, >> >> That's where the confusion is. >> From the backtrace we see that two go routines are blocked on the >> waitgroup. >> The go routine on which they are blocked is waiting on big query to >> return. >> >> On every user request we create a new go routine so they shouldn't get >> blocked because of these two. >> >> However, we do see one go routine blocked on futex. Is there a way to >> know from where it originated from ? >> When does golang use futex under the hold so that we can investigate from >> that angle. >> >> goroutine 0 [idle]: >> runtime.futex(0x5b607a8, 0x80, 0x0, 0x0, 0x0, 0x7ffc5dd1bd20, 0x437640, >> 0xc003638038, 0x7ffc5dd1bd48, 0x40c78f, ...) >> /usr/local/go/src/runtime/sys_linux_amd64.s:535 +0x21 >> runtime.futexsleep(0x5b607a8, 0x0, 0xffffffffffffffff) >> /usr/local/go/src/runtime/os_linux.go:44 +0x46 >> runtime.notesleep(0x5b607a8) >> /usr/local/go/src/runtime/lock_futex.go:151 +0x9f >> runtime.stopm() >> /usr/local/go/src/runtime/proc.go:1928 +0xc0 >> runtime.exitsyscall0(0xc003638000) >> /usr/local/go/src/runtime/proc.go:3140 +0x111 >> runtime.mcall(0x0) >> /usr/local/go/src/runtime/asm_amd64.s:318 +0x5b >> >> On Wed, Aug 26, 2020 at 5:02 PM robert engels <reng...@ix.netcom.com> >> wrote: >> >>> That should allow your server to clean up “dead” clients. Typically you >>> use this in conjunction with a ‘keep alive’ in the protocol. >>> >>> I am doubtful that a bunch of dead clients hanging around would cause a >>> CPU spike. You really don’t have too many Go routines/connections involved >>> (I’ve worked with 1000’s of live connections). >>> >>> I would look at something else… I am guessing your synchronization is >>> incorrect and your threads are blocking, and you have a few that are >>> spinning expecting something to happen that never will (because the source >>> of the event is blocked on the mutex/lock). >>> >>> >>> >>> On Aug 26, 2020, at 10:09 AM, Siddhesh Divekar < >>> siddhesh.dive...@gmail.com> wrote: >>> >>> Robert, >>> >>> I assume we can safely add these timeouts based on what we expect >>> should be a reasonable timeout for our clients ? >>> >>> s.ReadTimeout = expTimeOut * time.Second >>> s.WriteTimeout = expTimeOut * time.Second >>> >>> On Tue, Aug 25, 2020 at 1:14 PM Siddhesh Divekar < >>> siddhesh.dive...@gmail.com> wrote: >>> >>>> Both servers and data sources are in the cloud. >>>> I would not say a lot of data, it's precomputed data which shouldn't >>>> take that long. >>>> >>>> >>>> On Tue, Aug 25, 2020 at 11:25 AM Robert Engels <reng...@ix.netcom.com> >>>> wrote: >>>> >>>>> Are you transferring a lot of data? Are the servers non-cloud hosted? >>>>> >>>>> You could be encountering “tcp stalls”. >>>>> >>>>> On Aug 25, 2020, at 9:24 AM, Siddhesh Divekar < >>>>> siddhesh.dive...@gmail.com> wrote: >>>>> >>>>> >>>>> Clients are over the internet. >>>>> >>>>> On Tue, Aug 25, 2020 at 3:25 AM Robert Engels <reng...@ix.netcom.com> >>>>> wrote: >>>>> >>>>>> The tcp protocol allows the connection to wait for hours. Go routines >>>>>> stuck in wait do not burn CPU. Are the clients local or remote (over >>>>>> internet)? >>>>>> >>>>>> On Aug 24, 2020, at 10:29 PM, Siddhesh Divekar < >>>>>> siddhesh.dive...@gmail.com> wrote: >>>>>> >>>>>> >>>>>> Robert, >>>>>> >>>>>> We will do the profiling next time we hit the issue again & see what >>>>>> is happening. >>>>>> This was the first time we saw the issue & don't want to get rid of >>>>>> http2 advantages without making sure it's the actual culprit. >>>>>> >>>>>> Do you think in the meanwhile we should do what the discussion below >>>>>> suggests anyways ? >>>>>> >>>>>> https://stackoverflow.com/questions/42238695/goroutine-in-io-wait-state-for-long-time >>>>>> >>>>>> >>>>>> On Mon, Aug 24, 2020 at 5:37 PM Robert Engels <reng...@ix.netcom.com> >>>>>> wrote: >>>>>> >>>>>>> I think it is too hard to tell with the limited information. It >>>>>>> could be exhausted connections or it could be thrashing (the claim of >>>>>>> high >>>>>>> cpu) >>>>>>> >>>>>>> I think you want to run profiling capture prior to hitting the stick >>>>>>> state - you should be able to detect what is happening. >>>>>>> >>>>>>> If the referenced issue is related I would assume you should be able >>>>>>> to connect by forcing http/1. >>>>>>> >>>>>>> You can also try disabling http/2 and see if your issue goes away. >>>>>>> >>>>>>> On Aug 24, 2020, at 6:15 PM, Siddhesh Divekar < >>>>>>> siddhesh.dive...@gmail.com> wrote: >>>>>>> >>>>>>> Hi Robert, >>>>>>> >>>>>>> Sorry I missed your earlier response. >>>>>>> >>>>>>> From what we saw our UI was blocked and since everything was >>>>>>> unresponsive >>>>>>> we had to recover the system by sending sig abort. >>>>>>> >>>>>>> On Mon, Aug 24, 2020 at 4:11 PM Siddhesh Divekar < >>>>>>> siddhesh.dive...@gmail.com> wrote: >>>>>>> >>>>>>>> Looking at the no. of go routines we have does this apply to this >>>>>>>> issue ? >>>>>>>> https://github.com/golang/go/issues/27044 >>>>>>>> >>>>>>>> On Mon, Aug 24, 2020 at 12:54 PM Robert Engels < >>>>>>>> reng...@ix.netcom.com> wrote: >>>>>>>> >>>>>>>>> Go routines in a waiting state should not be consuming CPU. Are >>>>>>>>> you certain they are not in constant transition from waiting to >>>>>>>>> processing >>>>>>>>> - this could show up as high CPU usage while everything looks blocks. >>>>>>>>> >>>>>>>>> I would use pprof - github.com/robaho/goanalyzer might be of >>>>>>>>> assistance here to see the actual work being done. >>>>>>>>> >>>>>>>>> On Aug 24, 2020, at 9:10 AM, Siddhesh Divekar < >>>>>>>>> siddhesh.dive...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Ian, >>>>>>>>> >>>>>>>>> Thanks for replying. >>>>>>>>> >>>>>>>>> We have a go server running which handles user requests & collects >>>>>>>>> data from various sources like gcp's cloud sql and big query. >>>>>>>>> We are using shopify's sarama library to do kafka operations. >>>>>>>>> >>>>>>>>> There are seeing lots of go routines in waiting state for several >>>>>>>>> minutes. >>>>>>>>> Over the period of time around 587 goroutines have been spun up. >>>>>>>>> >>>>>>>>> We see that two go routines are stuck on gcp big query and we are >>>>>>>>> using wait groups there. >>>>>>>>> However, it's not clear why that would cause all other go routines >>>>>>>>> to get hung & make cpu go high. >>>>>>>>> >>>>>>>>> goroutine 3332131 [semacquire, 79 minutes]: >>>>>>>>> sync.runtime_Semacquire(0xc001c4fcf8) >>>>>>>>> /usr/local/go/src/runtime/sema.go:56 +0x42 >>>>>>>>> sync.(*WaitGroup).Wait(0xc001c4fcf0) >>>>>>>>> /usr/local/go/src/sync/waitgroup.go:130 +0x64 >>>>>>>>> >>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).runParallelQuery(0xc001b54d40, >>>>>>>>> 0xc002912c00, 0x330e1b0, 0xf, 0xc002912cf0, 0x3) >>>>>>>>> /builds/fusionio/fusion/controller/stats/ins_mgr.go:488 >>>>>>>>> +0x1d7 >>>>>>>>> >>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).GetMainUi(0xc001b54d40, >>>>>>>>> 0xc002912db8, 0xc001870e68, 0x746121, 0xc0010fcaf8, 0x17) >>>>>>>>> /builds/fusionio/fusion/controller/stats/ins_mgr.go:567 >>>>>>>>> +0xa0d >>>>>>>>> >>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*Prefetcher).fetchMainUiTeamInterval(0xc001b56780, >>>>>>>>> 0xc002356810, 0x24, 0x32f7b78, 0x5) >>>>>>>>> /builds/fusionio/fusion/controller/stats/prefetcher.go:77 >>>>>>>>> +0xf2 >>>>>>>>> created by >>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*Prefetcher).prefetchStats >>>>>>>>> /builds/fusionio/fusion/controller/stats/prefetcher.go:100 >>>>>>>>> +0xd8 >>>>>>>>> >>>>>>>>> >>>>>>>>> goroutine 3332149 [semacquire, 79 minutes]: >>>>>>>>> sync.runtime_Semacquire(0xc0015ede48) >>>>>>>>> /usr/local/go/src/runtime/sema.go:56 +0x42 >>>>>>>>> sync.(*WaitGroup).Wait(0xc0015ede40) >>>>>>>>> /usr/local/go/src/sync/waitgroup.go:130 +0x64 >>>>>>>>> >>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).runParallelQuery(0xc001b54d40, >>>>>>>>> 0xc00249dc00, 0x330e1b0, 0xf, 0xc00249dcf0, 0x3) >>>>>>>>> /builds/fusionio/fusion/controller/stats/ins_mgr.go:488 >>>>>>>>> +0x1d7 >>>>>>>>> >>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).GetMainUi(0xc001b54d40, >>>>>>>>> 0xc00249ddb8, 0xc003200668, 0xc00407a520, 0xc003200590, 0x46ee97) >>>>>>>>> /builds/fusionio/fusion/controller/stats/ins_mgr.go:567 >>>>>>>>> +0xa0d >>>>>>>>> >>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*Prefetcher).fetchMainUiTeamInterval(0xc001b56780, >>>>>>>>> 0xc002356ba0, 0x24, 0x32f7b78, 0x5) >>>>>>>>> /builds/fusionio/fusion/controller/stats/prefetcher.go:77 >>>>>>>>> +0xf2 >>>>>>>>> created by >>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*Prefetcher).prefetchStats >>>>>>>>> /builds/fusionio/fusion/controller/stats/prefetcher.go:100 >>>>>>>>> +0xd8 >>>>>>>>> >>>>>>>>> I found the link below which kind of co-relates to our scenario. >>>>>>>>> >>>>>>>>> https://stackoverflow.com/questions/42238695/goroutine-in-io-wait-state-for-long-time >>>>>>>>> >>>>>>>>> Most of the go routines in the backtrace are in a net/http package >>>>>>>>> so our suspicion is that above bug in our code might be causing that. >>>>>>>>> Even the bigquery is getting hung in net/http. >>>>>>>>> >>>>>>>>> We are using go version 1.13.8 & are running on gcp kubernetes >>>>>>>>> cluster on ubuntu 18.04 docker. >>>>>>>>> >>>>>>>>> go env >>>>>>>>> GO111MODULE="" >>>>>>>>> GOARCH="amd64" >>>>>>>>> GOBIN="" >>>>>>>>> GOCACHE="/root/.cache/go-build" >>>>>>>>> GOENV="/root/.config/go/env" >>>>>>>>> GOEXE="" >>>>>>>>> GOFLAGS="" >>>>>>>>> GOHOSTARCH="amd64" >>>>>>>>> GOHOSTOS="linux" >>>>>>>>> GONOPROXY="" >>>>>>>>> GONOSUMDB="" >>>>>>>>> GOOS="linux" >>>>>>>>> GOPATH="/root/go" >>>>>>>>> GOPRIVATE="" >>>>>>>>> GOPROXY="https://proxy.golang.org,direct" >>>>>>>>> GOROOT="/usr/local/go" >>>>>>>>> GOSUMDB="sum.golang.org" >>>>>>>>> GOTMPDIR="" >>>>>>>>> GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64" >>>>>>>>> GCCGO="gccgo" >>>>>>>>> AR="ar" >>>>>>>>> CC="gcc" >>>>>>>>> CXX="g++" >>>>>>>>> CGO_ENABLED="1" >>>>>>>>> GOMOD="/builds/prosimoio/prosimo/pdash/go.mod" >>>>>>>>> CGO_CFLAGS="-g -O2" >>>>>>>>> CGO_CPPFLAGS="" >>>>>>>>> CGO_CXXFLAGS="-g -O2" >>>>>>>>> CGO_FFLAGS="-g -O2" >>>>>>>>> CGO_LDFLAGS="-g -O2" >>>>>>>>> PKG_CONFIG="pkg-config" >>>>>>>>> GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 >>>>>>>>> -fdebug-prefix-map=/tmp/go-build048009048=/tmp/go-build >>>>>>>>> -gno-record-gcc-switches" >>>>>>>>> >>>>>>>>> Let me know if any other information is needed. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, Aug 22, 2020 at 12:30 PM Ian Lance Taylor <i...@golang.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> On Sat, Aug 22, 2020 at 12:06 PM Siddhesh Divekar >>>>>>>>>> <siddhesh.dive...@gmail.com> wrote: >>>>>>>>>> > >>>>>>>>>> > We saw an issue with our process running in k8s on ubuntu 18.04. >>>>>>>>>> > Most of the go routines are stuck for several minutes in >>>>>>>>>> http/http2 net code. >>>>>>>>>> > >>>>>>>>>> > Have you seen similar issues ? >>>>>>>>>> > >>>>>>>>>> > goroutine 2800143 [select, 324 minutes]: >>>>>>>>>> net/http.(*persistConn).readLoop(0xc00187d440) >>>>>>>>>> /usr/local/go/src/net/http/transport.go:2032 +0x999 created by >>>>>>>>>> net/http.(*Transport).dialConn >>>>>>>>>> /usr/local/go/src/net/http/transport.go:1580 >>>>>>>>>> +0xb0d goroutine 2738894 [IO wait, 352 minutes]: >>>>>>>>>> internal/poll.runtime_pollWait(0x7f5b61b280c0, 0x72, >>>>>>>>>> 0xffffffffffffffff) >>>>>>>>>> /usr/local/go/src/runtime/netpoll.go:184 +0x55 >>>>>>>>>> internal/poll.(*pollDesc).wait(0xc0017e7e18, 0x72, 0x1000, 0x1000, >>>>>>>>>> 0xffffffffffffffff) >>>>>>>>>> /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 >>>>>>>>>> +0x45 internal/poll.(*pollDesc).waitRead(...) >>>>>>>>>> /usr/local/go/src/internal/poll/fd_poll_runtime.go:92 >>>>>>>>>> internal/poll.(*FD).Read(0xc0017e7e00, 0xc0044a9000, 0x1000, 0x1000, >>>>>>>>>> 0x0, >>>>>>>>>> 0x0, 0x0) /usr/local/go/src/internal/poll/fd_unix.go:169 +0x1cf >>>>>>>>>> net.(*netFD).Read(0xc0017e7e00, 0xc0044a9000, 0x1000, 0x1000, >>>>>>>>>> 0xc0026359e8, >>>>>>>>>> 0x49d7fd, 0xc0017e7e00) /usr/local/go/src/net/fd_unix.go:202 +0x4f >>>>>>>>>> net.(*conn).Read(0xc0000db8b8, 0xc0044a9000, 0x1000, 0x1000, 0x0, >>>>>>>>>> 0x0, 0x0) >>>>>>>>>> /usr/local/go/src/net/net.go:184 +0x68 >>>>>>>>>> net/http.(*connReader).Read(0xc004a4fec0, 0xc0044a9000, 0x1000, >>>>>>>>>> 0x1000, >>>>>>>>>> 0x0, 0x0, 0x0) /usr/local/go/src/net/http/server.go:785 +0xf4 >>>>>>>>>> bufio.(*Reader).fill(0xc003f1a360) >>>>>>>>>> /usr/local/go/src/bufio/bufio.go:100 >>>>>>>>>> +0x103 bufio.(*Reader).Peek(0xc003f1a360, 0x4, 0x0, 0x0, 0x0, 0x0, >>>>>>>>>> 0xc002635ad0) /usr/local/go/src/bufio/bufio.go:138 +0x4f >>>>>>>>>> net/http.(*conn).readRequest(0xc0028e1d60, 0x393ed20, 0xc0024e9780, >>>>>>>>>> 0x0, >>>>>>>>>> 0x0, 0x0) /usr/local/go/src/net/http/server.go:962 +0xb3b >>>>>>>>>> net/http.(*conn).serve(0xc0028e1d60, 0x393ed20, 0xc0024e9780) >>>>>>>>>> /usr/local/go/src/net/http/server.go:1817 +0x6d4 created by >>>>>>>>>> net/http.(*Server).Serve /usr/local/go/src/net/http/server.go:2928 >>>>>>>>>> +0x384 >>>>>>>>>> > >>>>>>>>>> > Is there a know issue or something obvious from the backtrace >>>>>>>>>> here. >>>>>>>>>> >>>>>>>>>> It's entirely normal for goroutines to sit in pollWait if they are >>>>>>>>>> waiting for network I/O. There may be reasons why this is >>>>>>>>>> incorrect >>>>>>>>>> for your program, but you'll have to tell us those reasons. >>>>>>>>>> >>>>>>>>>> Also, along with those reasons, please tell us the version of Go >>>>>>>>>> and >>>>>>>>>> the exact environment that you are running. Thanks. >>>>>>>>>> >>>>>>>>>> Ian >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> -Siddhesh. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "golang-nuts" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to golang-nuts+unsubscr...@googlegroups.com. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/golang-nuts/CAMjfk%2BggC%2BwgwM_%2BM5ML0SKD3qJphCrif%3D4c2AqB9v6n%2Btw5Jw%40mail.gmail.com >>>>>>>>> <https://groups.google.com/d/msgid/golang-nuts/CAMjfk%2BggC%2BwgwM_%2BM5ML0SKD3qJphCrif%3D4c2AqB9v6n%2Btw5Jw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> <backtrace.txt> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> -Siddhesh. >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> -Siddhesh. >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> -Siddhesh. >>>>>> >>>>>> >>>>> >>>>> -- >>>>> -Siddhesh. >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "golang-nuts" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to golang-nuts+unsubscr...@googlegroups.com. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/golang-nuts/CAMjfk%2BgQDHshxmGFiUFh6Ket9fp%2BwMuhgH6-FksyzMp9wz%2BXug%40mail.gmail.com >>>>> <https://groups.google.com/d/msgid/golang-nuts/CAMjfk%2BgQDHshxmGFiUFh6Ket9fp%2BwMuhgH6-FksyzMp9wz%2BXug%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>>> >>>> >>>> -- >>>> -Siddhesh. >>>> >>> >>> >>> -- >>> -Siddhesh. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "golang-nuts" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to golang-nuts+unsubscr...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/golang-nuts/CAMjfk%2BgHmUTo-f4u2NOtvzuZ-VH7708%3DXBcM02c76KDCCPiY%3Dw%40mail.gmail.com >>> <https://groups.google.com/d/msgid/golang-nuts/CAMjfk%2BgHmUTo-f4u2NOtvzuZ-VH7708%3DXBcM02c76KDCCPiY%3Dw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >>> >>> >> >> -- >> -Siddhesh. >> >> > > -- > -Siddhesh. > -- -Siddhesh. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/CAMjfk%2Bj401-rcUQ6Jdv5nTQknCrG2uFeHY%3DjHcmf2TC0i4BatQ%40mail.gmail.com.