Re: [go-nuts] Memory operations hung in semacquire state, GC routine parked while holding semaphore?

'Michael Knyszek' via golang-nuts Wed, 29 May 2024 21:27:03 -0700

The example goroutine in the original post is parked waiting for some other 
goroutine to finish up the GC cycle. Somewhere, a goroutine is getting 
stuck trying to finish it up, which could possibly be a deadlock. (I am 
especially suspicious of a deadlock bug because most threads are stopped 
there.) It would be good to confirm whether application CPU time drops to 
zero or plateaus to some very consistent number, which would support the 
deadlock theory.

On Wednesday, May 29, 2024 at 11:09:30 AM UTC-4 robert engels wrote:

I am pretty sure runtime is supposed to crash the process if it slows the
allocators “too much” (I believe there are some config settings to control
this).

The runtime never intentionally crashes a process due to rate or volume of
allocation. There is an optional memory limit, but it's soft in that the
runtime gives up trying to maintain the limit at some point; see below and
see https://go.dev/doc/gc-guide.

If you have enough Go routines it may look like they are hung - you need to
track specific routines by their ID. The stack certainly looks like it is
trying to allocate memory for a slice - and it is being paused - because it
is waiting on the GC lock.

Are you doing any native code? You could have heap corruption and that is
also why the GC thread is getting hung.

If the GC thread makes progress albeit slowly - I would settle on my first
guess - memory leak...

I suspect a memory leak is not at play here. In the case of a memory leak,
by default, your live heap may steadily increase, and the time between GC
cycles increases proportionally. The GC is designed such that the overall
impact should stay roughly constant as your heap grows by default. In this
case, memory leak should appear exactly as expected: a rise in memory use
over time.

If GOMEMLIMIT is in use (which is non-default; you have to ask for it),
then a memory leak may result in progressively more frequent GC cycles.
However, the worst-case failure to make progress ("GC death spiral") should
be prevented in practice by the runtime already. It puts a hard limit on
the amount of CPU time that can be taken by the GC every CPU-second,
instead letting memory use increase. Don't get me wrong, it can still slow
down the application a good bit, and maybe under some circumstances it'll
still look like a hang. It is also possible there's a bug in the GC CPU
limiter or something.

But, from what I've seen so far in this bug, I'm more convinced of a
deadlock bug than the application failing to make progress. Lots of
goroutines lining up on gcMarkDone just sounds too specific to me.

On May 29, 2024, at 10:00 AM, 'Alex Kristiansen' via golang-nuts <
golan...@googlegroups.com> wrote:

That's an interesting idea, I probably wouldn't have thought of that on my
own. Is that expected behavior for memory pressure on Windows+golang? I
don't have much windows experience, so my assumption would be that the
Windows equivalent of the OOMKiller would kick in and just kill the
application.

On Tuesday, May 28, 2024 at 4:04:15 PM UTC-7 robert engels wrote:

Feels like a memory leak to me. I would look for growing heap size in the
gc logs. I am guessing that the system is not completely hung - but rather
the runtime is having a hard time obtaining more memory, so it is slowing
the allocators to a rate that makes them appear hung.

It may be that the process has consumed nearly all of the OS memory too -
so the OS is having a hard-time responding to malloc requests.

i.e. The system is not making progress.

On May 28, 2024, at 3:54 PM, 'Alex Kristiansen' via golang-nuts <
golan...@googlegroups.com> wrote:

This is an odd one. For reference, this is a customer machine, Windows
server 2016, compiled with go1.20.11. The application just hangs after a
number of days; windows minidump reveals that most threads are doing this:

Goroutine 462 - User: unicode/utf16/utf16.go:106 unicode/utf16.Decode
(0xe6e391) [semacquire]
0 0x0000000000e2d116 in runtime.gopark
at runtime/proc.go:382
1 0x0000000000e3df5c in runtime.goparkunlock
at runtime/proc.go:387
2 0x0000000000e3df5c in runtime.semacquire1
at runtime/sema.go:160
3 0x0000000000e0ac2f in runtime.semacquire
at runtime/sema.go:111
4 0x0000000000e0ac2f in runtime.gcMarkDone
at runtime/mgc.go:787
5 0x0000000000128c10 in ???
at ?:-1
6 0x0000000000dfe7da in runtime.deductAssistCredit
at runtime/malloc.go:1217
7 0x0000000000dfdff0 in runtime.mallocgc
at runtime/malloc.go:932
8 0x0000000000e3f972 in runtime.makeslice
at runtime/slice.go:103
9 0x0000000000e6e391 in unicode/utf16.Decode
at unicode/utf16/utf16.go:106
10 0x0000000000e72a7b in syscall.UTF16ToString
at syscall/syscall_windows.go:63
11 0x0000000000eb7a67 in os.(*File).readdir
at os/dir_windows.go:43
12 0x0000000000eb72c5 in os.(*File).Readdirnames
at os/dir.go:70
13 0x0000000000fb623a in path/filepath.glob
at path/filepath/match.go:346
14 0x0000000000fb5ea5 in path/filepath.globWithLimit
at path/filepath/match.go:273
15 0x00000000031de255 in path/filepath.Glob
at path/filepath/match.go:243

Multiple threads, all waiting on a semaphore inside mallocgc. The actual
reason we're waiting for memory changes, obviously (sometimes a string
method, or hashmap allocation, or logging call...). This behavior has
remained consistent across multiple hangs across a number of weeks. After a
bit of digging into the minidump, this (I think) is the thread that's
holding the semaphore(s), at least based on the line numbers:

Goroutine 37 - User: :0 ??? (0x7ffb0e056974) (thread 5628)
0 0x00007ffb0e056974 in ???
at ?:-1
1 0x0000000000e5c5a0 in runtime.systemstack_switch
at runtime/asm_amd64.s:463
2 0x0000000000e0ade5 in runtime.gcMarkDone
at runtime/mgc.go:855
3 0x0000000000128c10 in ???
at ?:-1
4 0x0000000000e5e881 in runtime.goexit
at runtime/asm_amd64.s:1598

Note that the precise point in gcMarkDone can change. I have another
minidump showing a hang at a different point in the same method:

Goroutine 19 - Go: :0 ??? (0x208351516f8) (thread 7164) [unknown wait
reason 30]
0 0x00007ffe66e66974 in ???
at ?:-1
1 0x0000000000e9c5a0 in runtime.systemstack_switch
at runtime/asm_amd64.s:463
2 0x0000000000e4acff in runtime.gcMarkDone
at runtime/mgc.go:807
3 0x0000000000128c10 in ???
at ?:-1
4 0x0000000000e9e881 in runtime.goexit
at runtime/asm_amd64.s:1598

What's happening here is that a goroutine has switched to the OS thread
stack and is running there in the runtime. It looks like a Windows minidump
(unsurprisingly) can't follow the relation. A regular goroutine stack trace
dump that gets emitted on a forced exit or crash is able to show both sides
of the system stack switch with a little bit of configuration.

On Linux, I would debug this by setting GOTRACEBACK=crash and then killing
the hung application with a SIGABRT. I am not sure what the equivalent on
Windows would be, but I can look into it. What GOTRACEBACK=crash does is it
causes threads that the runtime cannot easily preempt to have their stack
traces forcibly dumped on exit.

The commonality between these treads appears to be the `stacktrace()`
method.

Does anyone have any ideas for how to further debug this? Has anyone seen
anything like this? I've never seen anything like this before, and I can't
really reproduce it, as the only behavior is "let the application sit and
run for a few days." The only vaguely similar issue I've found is this,
which mentions interference from AV software:
https://github.com/golang/go/issues/52178.

Right now I'm waiting for the results of a gctrace log, as well as setting
GODEBUG=asyncpreemptoff=1, but other than that, I'm a bit out of ideas.

+1, GODEBUG=asyncpreemptoff=1 is good to rule out, and a GODEBUG=gctrace=1
is always good go-to for understanding what's going on.

Also, would you able to share a full stack trace of all goroutines (perhaps
anonymized)? At that point, it may also be worth filing a bug at
github.com/golang/go/issues/new.

--
You received this message because you are subscribed to the Google Groups
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/golang-nuts/9c78ef6f-0190-4981-ab1f-ed18ad2f1080n%40googlegroups.com

<https://groups.google.com/d/msgid/golang-nuts/9c78ef6f-0190-4981-ab1f-ed18ad2f1080n%40googlegroups.com?utm_medium=email&utm_source=footer>
.

To view this discussion on the web visit
https://groups.google.com/d/msgid/golang-nuts/4d063278-64ee-47f6-84c9-37e18a4814a2n%40googlegroups.com

<https://groups.google.com/d/msgid/golang-nuts/4d063278-64ee-47f6-84c9-37e18a4814a2n%40googlegroups.com?utm_medium=email&utm_source=footer>
.

--
You received this message because you are subscribed to the Google Groups
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/golang-nuts/44b129bc-3522-4e2d-be8e-6399ac996a72n%40googlegroups.com.

Re: [go-nuts] Memory operations hung in semacquire state, GC routine parked while holding semaphore?

Reply via email to