The big query client may be “single threaded” meaning you may need a pool of connections. Not sure - haven’t used the BQ library from Go.
If that’s the case a single slow query will block all of your connections (if they are all accessing BQ). Plus there are BQ rate limits - when you hit those the connector may block - again blocking all clients. > On Aug 26, 2020, at 8:27 PM, Siddhesh Divekar <siddhesh.dive...@gmail.com> > wrote: > > > Robert, > > That's where the confusion is. > From the backtrace we see that two go routines are blocked on the waitgroup. > The go routine on which they are blocked is waiting on big query to return. > > On every user request we create a new go routine so they shouldn't get > blocked because of these two. > > However, we do see one go routine blocked on futex. Is there a way to know > from where it originated from ? > When does golang use futex under the hold so that we can investigate from > that angle. > > goroutine 0 [idle]: > runtime.futex(0x5b607a8, 0x80, 0x0, 0x0, 0x0, 0x7ffc5dd1bd20, 0x437640, > 0xc003638038, 0x7ffc5dd1bd48, 0x40c78f, ...) > /usr/local/go/src/runtime/sys_linux_amd64.s:535 +0x21 > runtime.futexsleep(0x5b607a8, 0x0, 0xffffffffffffffff) > /usr/local/go/src/runtime/os_linux.go:44 +0x46 > runtime.notesleep(0x5b607a8) > /usr/local/go/src/runtime/lock_futex.go:151 +0x9f > runtime.stopm() > /usr/local/go/src/runtime/proc.go:1928 +0xc0 > runtime.exitsyscall0(0xc003638000) > /usr/local/go/src/runtime/proc.go:3140 +0x111 > runtime.mcall(0x0) > /usr/local/go/src/runtime/asm_amd64.s:318 +0x5b > > On Wed, Aug 26, 2020 at 5:02 PM robert engels <reng...@ix.netcom.com> wrote: >> That should allow your server to clean up “dead” clients. Typically you use >> this in conjunction with a ‘keep alive’ in the protocol. >> >> I am doubtful that a bunch of dead clients hanging around would cause a CPU >> spike. You really don’t have too many Go routines/connections involved (I’ve >> worked with 1000’s of live connections). >> >> I would look at something else… I am guessing your synchronization is >> incorrect and your threads are blocking, and you have a few that are >> spinning expecting something to happen that never will (because the source >> of the event is blocked on the mutex/lock). >> >> >> >>> On Aug 26, 2020, at 10:09 AM, Siddhesh Divekar <siddhesh.dive...@gmail.com> >>> wrote: >>> >>> Robert, >>> >>> I assume we can safely add these timeouts based on what we expect >>> should be a reasonable timeout for our clients ? >>> >>> s.ReadTimeout = expTimeOut * time.Second >>> s.WriteTimeout = expTimeOut * time.Second >>> >>> On Tue, Aug 25, 2020 at 1:14 PM Siddhesh Divekar >>> <siddhesh.dive...@gmail.com> wrote: >>>> Both servers and data sources are in the cloud. >>>> I would not say a lot of data, it's precomputed data which shouldn't take >>>> that long. >>>> >>>> >>>> On Tue, Aug 25, 2020 at 11:25 AM Robert Engels <reng...@ix.netcom.com> >>>> wrote: >>>>> Are you transferring a lot of data? Are the servers non-cloud hosted? >>>>> >>>>> You could be encountering “tcp stalls”. >>>>> >>>>>> On Aug 25, 2020, at 9:24 AM, Siddhesh Divekar >>>>>> <siddhesh.dive...@gmail.com> wrote: >>>>>> >>>>>> >>>>>> Clients are over the internet. >>>>>> >>>>>> On Tue, Aug 25, 2020 at 3:25 AM Robert Engels <reng...@ix.netcom.com> >>>>>> wrote: >>>>>>> The tcp protocol allows the connection to wait for hours. Go routines >>>>>>> stuck in wait do not burn CPU. Are the clients local or remote (over >>>>>>> internet)? >>>>>>> >>>>>>>> On Aug 24, 2020, at 10:29 PM, Siddhesh Divekar >>>>>>>> <siddhesh.dive...@gmail.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> Robert, >>>>>>>> >>>>>>>> We will do the profiling next time we hit the issue again & see what >>>>>>>> is happening. >>>>>>>> This was the first time we saw the issue & don't want to get rid of >>>>>>>> http2 advantages without making sure it's the actual culprit. >>>>>>>> >>>>>>>> Do you think in the meanwhile we should do what the discussion below >>>>>>>> suggests anyways ? >>>>>>>> https://stackoverflow.com/questions/42238695/goroutine-in-io-wait-state-for-long-time >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Aug 24, 2020 at 5:37 PM Robert Engels <reng...@ix.netcom.com> >>>>>>>> wrote: >>>>>>>>> I think it is too hard to tell with the limited information. It could >>>>>>>>> be exhausted connections or it could be thrashing (the claim of high >>>>>>>>> cpu) >>>>>>>>> >>>>>>>>> I think you want to run profiling capture prior to hitting the stick >>>>>>>>> state - you should be able to detect what is happening. >>>>>>>>> >>>>>>>>> If the referenced issue is related I would assume you should be able >>>>>>>>> to connect by forcing http/1. >>>>>>>>> >>>>>>>>> You can also try disabling http/2 and see if your issue goes away. >>>>>>>>> >>>>>>>>> On Aug 24, 2020, at 6:15 PM, Siddhesh Divekar >>>>>>>>> <siddhesh.dive...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Robert, >>>>>>>>>> >>>>>>>>>> Sorry I missed your earlier response. >>>>>>>>>> >>>>>>>>>> From what we saw our UI was blocked and since everything was >>>>>>>>>> unresponsive >>>>>>>>>> we had to recover the system by sending sig abort. >>>>>>>>>> >>>>>>>>>> On Mon, Aug 24, 2020 at 4:11 PM Siddhesh Divekar >>>>>>>>>> <siddhesh.dive...@gmail.com> wrote: >>>>>>>>>>> Looking at the no. of go routines we have does this apply to this >>>>>>>>>>> issue ? >>>>>>>>>>> https://github.com/golang/go/issues/27044 >>>>>>>>>>> >>>>>>>>>>> On Mon, Aug 24, 2020 at 12:54 PM Robert Engels >>>>>>>>>>> <reng...@ix.netcom.com> wrote: >>>>>>>>>>>> Go routines in a waiting state should not be consuming CPU. Are >>>>>>>>>>>> you certain they are not in constant transition from waiting to >>>>>>>>>>>> processing - this could show up as high CPU usage while everything >>>>>>>>>>>> looks blocks. >>>>>>>>>>>> >>>>>>>>>>>> I would use pprof - github.com/robaho/goanalyzer might be of >>>>>>>>>>>> assistance here to see the actual work being done. >>>>>>>>>>>> >>>>>>>>>>>>> On Aug 24, 2020, at 9:10 AM, Siddhesh Divekar >>>>>>>>>>>>> <siddhesh.dive...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Ian, >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for replying. >>>>>>>>>>>>> >>>>>>>>>>>>> We have a go server running which handles user requests & >>>>>>>>>>>>> collects data from various sources like gcp's cloud sql and big >>>>>>>>>>>>> query. >>>>>>>>>>>>> We are using shopify's sarama library to do kafka operations. >>>>>>>>>>>>> >>>>>>>>>>>>> There are seeing lots of go routines in waiting state for several >>>>>>>>>>>>> minutes. >>>>>>>>>>>>> Over the period of time around 587 goroutines have been spun up. >>>>>>>>>>>>> >>>>>>>>>>>>> We see that two go routines are stuck on gcp big query and we are >>>>>>>>>>>>> using wait groups there. >>>>>>>>>>>>> However, it's not clear why that would cause all other go >>>>>>>>>>>>> routines to get hung & make cpu go high. >>>>>>>>>>>>> >>>>>>>>>>>>> goroutine 3332131 [semacquire, 79 minutes]: >>>>>>>>>>>>> sync.runtime_Semacquire(0xc001c4fcf8) >>>>>>>>>>>>> /usr/local/go/src/runtime/sema.go:56 +0x42 >>>>>>>>>>>>> sync.(*WaitGroup).Wait(0xc001c4fcf0) >>>>>>>>>>>>> /usr/local/go/src/sync/waitgroup.go:130 +0x64 >>>>>>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).runParallelQuery(0xc001b54d40, >>>>>>>>>>>>> 0xc002912c00, 0x330e1b0, 0xf, 0xc002912cf0, 0x3) >>>>>>>>>>>>> /builds/fusionio/fusion/controller/stats/ins_mgr.go:488 >>>>>>>>>>>>> +0x1d7 >>>>>>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).GetMainUi(0xc001b54d40, >>>>>>>>>>>>> 0xc002912db8, 0xc001870e68, 0x746121, 0xc0010fcaf8, 0x17) >>>>>>>>>>>>> /builds/fusionio/fusion/controller/stats/ins_mgr.go:567 >>>>>>>>>>>>> +0xa0d >>>>>>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*Prefetcher).fetchMainUiTeamInterval(0xc001b56780, >>>>>>>>>>>>> 0xc002356810, 0x24, 0x32f7b78, 0x5) >>>>>>>>>>>>> /builds/fusionio/fusion/controller/stats/prefetcher.go:77 >>>>>>>>>>>>> +0xf2 >>>>>>>>>>>>> created by >>>>>>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*Prefetcher).prefetchStats >>>>>>>>>>>>> >>>>>>>>>>>>> /builds/fusionio/fusion/controller/stats/prefetcher.go:100 +0xd8 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> goroutine 3332149 [semacquire, 79 minutes]: >>>>>>>>>>>>> sync.runtime_Semacquire(0xc0015ede48) >>>>>>>>>>>>> /usr/local/go/src/runtime/sema.go:56 +0x42 >>>>>>>>>>>>> sync.(*WaitGroup).Wait(0xc0015ede40) >>>>>>>>>>>>> /usr/local/go/src/sync/waitgroup.go:130 +0x64 >>>>>>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).runParallelQuery(0xc001b54d40, >>>>>>>>>>>>> 0xc00249dc00, 0x330e1b0, 0xf, 0xc00249dcf0, 0x3) >>>>>>>>>>>>> /builds/fusionio/fusion/controller/stats/ins_mgr.go:488 >>>>>>>>>>>>> +0x1d7 >>>>>>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).GetMainUi(0xc001b54d40, >>>>>>>>>>>>> 0xc00249ddb8, 0xc003200668, 0xc00407a520, 0xc003200590, 0x46ee97) >>>>>>>>>>>>> /builds/fusionio/fusion/controller/stats/ins_mgr.go:567 >>>>>>>>>>>>> +0xa0d >>>>>>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*Prefetcher).fetchMainUiTeamInterval(0xc001b56780, >>>>>>>>>>>>> 0xc002356ba0, 0x24, 0x32f7b78, 0x5) >>>>>>>>>>>>> /builds/fusionio/fusion/controller/stats/prefetcher.go:77 >>>>>>>>>>>>> +0xf2 >>>>>>>>>>>>> created by >>>>>>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*Prefetcher).prefetchStats >>>>>>>>>>>>> >>>>>>>>>>>>> /builds/fusionio/fusion/controller/stats/prefetcher.go:100 +0xd8 >>>>>>>>>>>>> >>>>>>>>>>>>> I found the link below which kind of co-relates to our scenario. >>>>>>>>>>>>> https://stackoverflow.com/questions/42238695/goroutine-in-io-wait-state-for-long-time >>>>>>>>>>>>> >>>>>>>>>>>>> Most of the go routines in the backtrace are in a net/http >>>>>>>>>>>>> package so our suspicion is that above bug in our code might be >>>>>>>>>>>>> causing that. >>>>>>>>>>>>> Even the bigquery is getting hung in net/http. >>>>>>>>>>>>> >>>>>>>>>>>>> We are using go version 1.13.8 & are running on gcp kubernetes >>>>>>>>>>>>> cluster on ubuntu 18.04 docker. >>>>>>>>>>>>> >>>>>>>>>>>>> go env >>>>>>>>>>>>> GO111MODULE="" >>>>>>>>>>>>> GOARCH="amd64" >>>>>>>>>>>>> GOBIN="" >>>>>>>>>>>>> GOCACHE="/root/.cache/go-build" >>>>>>>>>>>>> GOENV="/root/.config/go/env" >>>>>>>>>>>>> GOEXE="" >>>>>>>>>>>>> GOFLAGS="" >>>>>>>>>>>>> GOHOSTARCH="amd64" >>>>>>>>>>>>> GOHOSTOS="linux" >>>>>>>>>>>>> GONOPROXY="" >>>>>>>>>>>>> GONOSUMDB="" >>>>>>>>>>>>> GOOS="linux" >>>>>>>>>>>>> GOPATH="/root/go" >>>>>>>>>>>>> GOPRIVATE="" >>>>>>>>>>>>> GOPROXY="https://proxy.golang.org,direct" >>>>>>>>>>>>> GOROOT="/usr/local/go" >>>>>>>>>>>>> GOSUMDB="sum.golang.org" >>>>>>>>>>>>> GOTMPDIR="" >>>>>>>>>>>>> GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64" >>>>>>>>>>>>> GCCGO="gccgo" >>>>>>>>>>>>> AR="ar" >>>>>>>>>>>>> CC="gcc" >>>>>>>>>>>>> CXX="g++" >>>>>>>>>>>>> CGO_ENABLED="1" >>>>>>>>>>>>> GOMOD="/builds/prosimoio/prosimo/pdash/go.mod" >>>>>>>>>>>>> CGO_CFLAGS="-g -O2" >>>>>>>>>>>>> CGO_CPPFLAGS="" >>>>>>>>>>>>> CGO_CXXFLAGS="-g -O2" >>>>>>>>>>>>> CGO_FFLAGS="-g -O2" >>>>>>>>>>>>> CGO_LDFLAGS="-g -O2" >>>>>>>>>>>>> PKG_CONFIG="pkg-config" >>>>>>>>>>>>> GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 >>>>>>>>>>>>> -fdebug-prefix-map=/tmp/go-build048009048=/tmp/go-build >>>>>>>>>>>>> -gno-record-gcc-switches" >>>>>>>>>>>>> >>>>>>>>>>>>> Let me know if any other information is needed. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, Aug 22, 2020 at 12:30 PM Ian Lance Taylor >>>>>>>>>>>>> <i...@golang.org> wrote: >>>>>>>>>>>>>> On Sat, Aug 22, 2020 at 12:06 PM Siddhesh Divekar >>>>>>>>>>>>>> <siddhesh.dive...@gmail.com> wrote: >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > We saw an issue with our process running in k8s on ubuntu >>>>>>>>>>>>>> > 18.04. >>>>>>>>>>>>>> > Most of the go routines are stuck for several minutes in >>>>>>>>>>>>>> > http/http2 net code. >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Have you seen similar issues ? >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > goroutine 2800143 [select, 324 minutes]: >>>>>>>>>>>>>> > net/http.(*persistConn).readLoop(0xc00187d440) >>>>>>>>>>>>>> > /usr/local/go/src/net/http/transport.go:2032 +0x999 created by >>>>>>>>>>>>>> > net/http.(*Transport).dialConn >>>>>>>>>>>>>> > /usr/local/go/src/net/http/transport.go:1580 +0xb0d goroutine >>>>>>>>>>>>>> > 2738894 [IO wait, 352 minutes]: >>>>>>>>>>>>>> > internal/poll.runtime_pollWait(0x7f5b61b280c0, 0x72, >>>>>>>>>>>>>> > 0xffffffffffffffff) /usr/local/go/src/runtime/netpoll.go:184 >>>>>>>>>>>>>> > +0x55 internal/poll.(*pollDesc).wait(0xc0017e7e18, 0x72, >>>>>>>>>>>>>> > 0x1000, 0x1000, 0xffffffffffffffff) >>>>>>>>>>>>>> > /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45 >>>>>>>>>>>>>> > internal/poll.(*pollDesc).waitRead(...) >>>>>>>>>>>>>> > /usr/local/go/src/internal/poll/fd_poll_runtime.go:92 >>>>>>>>>>>>>> > internal/poll.(*FD).Read(0xc0017e7e00, 0xc0044a9000, 0x1000, >>>>>>>>>>>>>> > 0x1000, 0x0, 0x0, 0x0) >>>>>>>>>>>>>> > /usr/local/go/src/internal/poll/fd_unix.go:169 +0x1cf >>>>>>>>>>>>>> > net.(*netFD).Read(0xc0017e7e00, 0xc0044a9000, 0x1000, 0x1000, >>>>>>>>>>>>>> > 0xc0026359e8, 0x49d7fd, 0xc0017e7e00) >>>>>>>>>>>>>> > /usr/local/go/src/net/fd_unix.go:202 +0x4f >>>>>>>>>>>>>> > net.(*conn).Read(0xc0000db8b8, 0xc0044a9000, 0x1000, 0x1000, >>>>>>>>>>>>>> > 0x0, 0x0, 0x0) /usr/local/go/src/net/net.go:184 +0x68 >>>>>>>>>>>>>> > net/http.(*connReader).Read(0xc004a4fec0, 0xc0044a9000, >>>>>>>>>>>>>> > 0x1000, 0x1000, 0x0, 0x0, 0x0) >>>>>>>>>>>>>> > /usr/local/go/src/net/http/server.go:785 +0xf4 >>>>>>>>>>>>>> > bufio.(*Reader).fill(0xc003f1a360) >>>>>>>>>>>>>> > /usr/local/go/src/bufio/bufio.go:100 +0x103 >>>>>>>>>>>>>> > bufio.(*Reader).Peek(0xc003f1a360, 0x4, 0x0, 0x0, 0x0, 0x0, >>>>>>>>>>>>>> > 0xc002635ad0) /usr/local/go/src/bufio/bufio.go:138 +0x4f >>>>>>>>>>>>>> > net/http.(*conn).readRequest(0xc0028e1d60, 0x393ed20, >>>>>>>>>>>>>> > 0xc0024e9780, 0x0, 0x0, 0x0) >>>>>>>>>>>>>> > /usr/local/go/src/net/http/server.go:962 +0xb3b >>>>>>>>>>>>>> > net/http.(*conn).serve(0xc0028e1d60, 0x393ed20, 0xc0024e9780) >>>>>>>>>>>>>> > /usr/local/go/src/net/http/server.go:1817 +0x6d4 created by >>>>>>>>>>>>>> > net/http.(*Server).Serve >>>>>>>>>>>>>> > /usr/local/go/src/net/http/server.go:2928 +0x384 >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Is there a know issue or something obvious from the backtrace >>>>>>>>>>>>>> > here. >>>>>>>>>>>>>> >>>>>>>>>>>>>> It's entirely normal for goroutines to sit in pollWait if they >>>>>>>>>>>>>> are >>>>>>>>>>>>>> waiting for network I/O. There may be reasons why this is >>>>>>>>>>>>>> incorrect >>>>>>>>>>>>>> for your program, but you'll have to tell us those reasons. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Also, along with those reasons, please tell us the version of Go >>>>>>>>>>>>>> and >>>>>>>>>>>>>> the exact environment that you are running. Thanks. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Ian >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> -Siddhesh. >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>> Google Groups "golang-nuts" group. >>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>>>>>> send an email to golang-nuts+unsubscr...@googlegroups.com. >>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>> https://groups.google.com/d/msgid/golang-nuts/CAMjfk%2BggC%2BwgwM_%2BM5ML0SKD3qJphCrif%3D4c2AqB9v6n%2Btw5Jw%40mail.gmail.com. >>>>>>>>>>>>> <backtrace.txt> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> -Siddhesh. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> -Siddhesh. >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> -Siddhesh. >>>>>> >>>>>> >>>>>> -- >>>>>> -Siddhesh. >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "golang-nuts" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>>> an email to golang-nuts+unsubscr...@googlegroups.com. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/golang-nuts/CAMjfk%2BgQDHshxmGFiUFh6Ket9fp%2BwMuhgH6-FksyzMp9wz%2BXug%40mail.gmail.com. >>>> >>>> >>>> -- >>>> -Siddhesh. >>> >>> >>> -- >>> -Siddhesh. >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "golang-nuts" group. >>> To unsubscribe from this group and stop receiving emails from it, send an >>> email to golang-nuts+unsubscr...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/golang-nuts/CAMjfk%2BgHmUTo-f4u2NOtvzuZ-VH7708%3DXBcM02c76KDCCPiY%3Dw%40mail.gmail.com. >> > > > -- > -Siddhesh. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/5319910A-98C1-4B9B-AE6C-4882F940709C%40ix.netcom.com.