The example goroutine in the original post is parked waiting for some other 
goroutine to finish up the GC cycle. Somewhere, a goroutine is getting 
stuck trying to finish it up, which could possibly be a deadlock. (I am 
especially suspicious of a deadlock bug because most threads are stopped 
there.) It would be good to confirm whether application CPU time drops to 
zero or plateaus to some very consistent number, which would support the 
deadlock theory.

On Wednesday, May 29, 2024 at 11:09:30 AM UTC-4 robert engels wrote:

I am pretty sure runtime is supposed to crash the process if it slows the 
allocators “too much” (I believe there are some config settings to control 
this).

The runtime never intentionally crashes a process due to rate or volume of 
allocation. There is an optional memory limit, but it's soft in that the 
runtime gives up trying to maintain the limit at some point; see below and 
see https://go.dev/doc/gc-guide.


If you have enough Go routines it may look like they are hung - you need to 
track specific routines by their ID. The stack certainly looks like it is 
trying to allocate memory for a slice - and it is being paused - because it 
is waiting on the GC lock.

Are you doing any native code? You could have heap corruption and that is 
also why the GC thread is getting hung.

If the GC thread makes progress albeit slowly - I would settle on my first 
guess - memory leak...

I suspect a memory leak is not at play here. In the case of a memory leak, 
by default, your live heap may steadily increase, and the time between GC 
cycles increases proportionally. The GC is designed such that the overall 
impact should stay roughly constant as your heap grows by default. In this 
case, memory leak should appear exactly as expected: a rise in memory use 
over time.

If GOMEMLIMIT is in use (which is non-default; you have to ask for it), 
then a memory leak may result in progressively more frequent GC cycles. 
However, the worst-case failure to make progress ("GC death spiral") should 
be prevented in practice by the runtime already. It puts a hard limit on 
the amount of CPU time that can be taken by the GC every CPU-second, 
instead letting memory use increase. Don't get me wrong, it can still slow 
down the application a good bit, and maybe under some circumstances it'll 
still look like a hang. It is also possible there's a bug in the GC CPU 
limiter or something.

But, from what I've seen so far in this bug, I'm more convinced of a 
deadlock bug than the application failing to make progress. Lots of 
goroutines lining up on gcMarkDone just sounds too specific to me.


On May 29, 2024, at 10:00 AM, 'Alex Kristiansen' via golang-nuts <
golan...@googlegroups.com> wrote:

That's an interesting idea, I probably wouldn't have thought of that on my 
own. Is that expected behavior for memory pressure on Windows+golang? I 
don't have much windows experience, so my assumption would be that the 
Windows equivalent of the OOMKiller would kick in and just kill the 
application. 

On Tuesday, May 28, 2024 at 4:04:15 PM UTC-7 robert engels wrote:

Feels like a memory leak to me. I would look for growing heap size in the 
gc logs. I am guessing that the system is not completely hung - but rather 
the runtime is having a hard time obtaining more memory, so it is slowing 
the allocators to a rate that makes them appear hung.

It may be that the process has consumed nearly all of the OS memory too - 
so the OS is having a hard-time responding to malloc requests.

i.e. The system is not making progress.

On May 28, 2024, at 3:54 PM, 'Alex Kristiansen' via golang-nuts <
golan...@googlegroups.com> wrote:

This is an odd one. For reference, this is a customer machine, Windows 
server 2016, compiled with go1.20.11. The application just hangs after a 
number of days; windows minidump reveals that most threads are doing this:

  Goroutine 462 - User: unicode/utf16/utf16.go:106 unicode/utf16.Decode 
(0xe6e391) [semacquire]
         0  0x0000000000e2d116 in runtime.gopark
             at runtime/proc.go:382
         1  0x0000000000e3df5c in runtime.goparkunlock
             at runtime/proc.go:387
         2  0x0000000000e3df5c in runtime.semacquire1
             at runtime/sema.go:160
         3  0x0000000000e0ac2f in runtime.semacquire
             at runtime/sema.go:111
         4  0x0000000000e0ac2f in runtime.gcMarkDone
             at runtime/mgc.go:787
         5  0x0000000000128c10 in ???
             at ?:-1
         6  0x0000000000dfe7da in runtime.deductAssistCredit
             at runtime/malloc.go:1217
         7  0x0000000000dfdff0 in runtime.mallocgc
             at runtime/malloc.go:932
         8  0x0000000000e3f972 in runtime.makeslice
             at runtime/slice.go:103
         9  0x0000000000e6e391 in unicode/utf16.Decode
             at unicode/utf16/utf16.go:106
        10  0x0000000000e72a7b in syscall.UTF16ToString
             at syscall/syscall_windows.go:63
        11  0x0000000000eb7a67 in os.(*File).readdir
             at os/dir_windows.go:43
        12  0x0000000000eb72c5 in os.(*File).Readdirnames
             at os/dir.go:70
        13  0x0000000000fb623a in path/filepath.glob
             at path/filepath/match.go:346
        14  0x0000000000fb5ea5 in path/filepath.globWithLimit
             at path/filepath/match.go:273
        15  0x00000000031de255 in path/filepath.Glob
             at path/filepath/match.go:243

Multiple threads, all waiting on a semaphore inside mallocgc. The actual 
reason we're waiting for memory changes, obviously (sometimes a string 
method, or hashmap allocation, or logging call...). This behavior has 
remained consistent across multiple hangs across a number of weeks. After a 
bit of digging into the minidump, this (I think) is the thread that's 
holding the semaphore(s), at least based on the line numbers:

  Goroutine 37 - User: :0 ??? (0x7ffb0e056974) (thread 5628)
        0  0x00007ffb0e056974 in ???
            at ?:-1
        1  0x0000000000e5c5a0 in runtime.systemstack_switch
            at runtime/asm_amd64.s:463
        2  0x0000000000e0ade5 in runtime.gcMarkDone
            at runtime/mgc.go:855
        3  0x0000000000128c10 in ???
            at ?:-1
        4  0x0000000000e5e881 in runtime.goexit
            at runtime/asm_amd64.s:1598

Note that the precise point in gcMarkDone can change. I have another 
minidump showing a hang at a different point in the same method:

Goroutine 19 - Go: :0 ??? (0x208351516f8) (thread 7164) [unknown wait 
reason 30] 
0 0x00007ffe66e66974 in ??? 
    at ?:-1 
1 0x0000000000e9c5a0 in runtime.systemstack_switch 
    at runtime/asm_amd64.s:463 
2 0x0000000000e4acff in runtime.gcMarkDone 
    at runtime/mgc.go:807 
3 0x0000000000128c10 in ??? 
    at ?:-1 
4 0x0000000000e9e881 in runtime.goexit 
    at runtime/asm_amd64.s:1598 

What's happening here is that a goroutine has switched to the OS thread 
stack and is running there in the runtime. It looks like a Windows minidump 
(unsurprisingly) can't follow the relation. A regular goroutine stack trace 
dump that gets emitted on a forced exit or crash is able to show both sides 
of the system stack switch with a little bit of configuration.

On Linux, I would debug this by setting GOTRACEBACK=crash and then killing 
the hung application with a SIGABRT. I am not sure what the equivalent on 
Windows would be, but I can look into it. What GOTRACEBACK=crash does is it 
causes threads that the runtime cannot easily preempt to have their stack 
traces forcibly dumped on exit.

The commonality between these treads appears to be the `stacktrace()` 
method. 

Does anyone have any ideas for how to further debug this? Has anyone seen 
anything like this? I've never seen anything like this before, and I can't 
really reproduce it, as the only behavior is "let the application sit and 
run for a few days." The only vaguely similar issue I've found is this, 
which mentions interference from AV software: 
https://github.com/golang/go/issues/52178. 

Right now I'm waiting for the results of a gctrace log, as well as setting 
GODEBUG=asyncpreemptoff=1, but other than that, I'm a bit out of ideas. 

+1, GODEBUG=asyncpreemptoff=1 is good to rule out, and a GODEBUG=gctrace=1 
is always good go-to for understanding what's going on.

Also, would you able to share a full stack trace of all goroutines (perhaps 
anonymized)? At that point, it may also be worth filing a bug at 
github.com/golang/go/issues/new.


-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an 
email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/9c78ef6f-0190-4981-ab1f-ed18ad2f1080n%40googlegroups.com
 
<https://groups.google.com/d/msgid/golang-nuts/9c78ef6f-0190-4981-ab1f-ed18ad2f1080n%40googlegroups.com?utm_medium=email&utm_source=footer>
.



-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an 
email to golang-nuts...@googlegroups.com.

To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/4d063278-64ee-47f6-84c9-37e18a4814a2n%40googlegroups.com
 
<https://groups.google.com/d/msgid/golang-nuts/4d063278-64ee-47f6-84c9-37e18a4814a2n%40googlegroups.com?utm_medium=email&utm_source=footer>
.


-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/44b129bc-3522-4e2d-be8e-6399ac996a72n%40googlegroups.com.

Reply via email to