The example goroutine in the original post is parked waiting for some other goroutine to finish up the GC cycle. Somewhere, a goroutine is getting stuck trying to finish it up, which could possibly be a deadlock. (I am especially suspicious of a deadlock bug because most threads are stopped there.) It would be good to confirm whether application CPU time drops to zero or plateaus to some very consistent number, which would support the deadlock theory.
On Wednesday, May 29, 2024 at 11:09:30 AM UTC-4 robert engels wrote: I am pretty sure runtime is supposed to crash the process if it slows the allocators “too much” (I believe there are some config settings to control this). The runtime never intentionally crashes a process due to rate or volume of allocation. There is an optional memory limit, but it's soft in that the runtime gives up trying to maintain the limit at some point; see below and see https://go.dev/doc/gc-guide. If you have enough Go routines it may look like they are hung - you need to track specific routines by their ID. The stack certainly looks like it is trying to allocate memory for a slice - and it is being paused - because it is waiting on the GC lock. Are you doing any native code? You could have heap corruption and that is also why the GC thread is getting hung. If the GC thread makes progress albeit slowly - I would settle on my first guess - memory leak... I suspect a memory leak is not at play here. In the case of a memory leak, by default, your live heap may steadily increase, and the time between GC cycles increases proportionally. The GC is designed such that the overall impact should stay roughly constant as your heap grows by default. In this case, memory leak should appear exactly as expected: a rise in memory use over time. If GOMEMLIMIT is in use (which is non-default; you have to ask for it), then a memory leak may result in progressively more frequent GC cycles. However, the worst-case failure to make progress ("GC death spiral") should be prevented in practice by the runtime already. It puts a hard limit on the amount of CPU time that can be taken by the GC every CPU-second, instead letting memory use increase. Don't get me wrong, it can still slow down the application a good bit, and maybe under some circumstances it'll still look like a hang. It is also possible there's a bug in the GC CPU limiter or something. But, from what I've seen so far in this bug, I'm more convinced of a deadlock bug than the application failing to make progress. Lots of goroutines lining up on gcMarkDone just sounds too specific to me. On May 29, 2024, at 10:00 AM, 'Alex Kristiansen' via golang-nuts < golan...@googlegroups.com> wrote: That's an interesting idea, I probably wouldn't have thought of that on my own. Is that expected behavior for memory pressure on Windows+golang? I don't have much windows experience, so my assumption would be that the Windows equivalent of the OOMKiller would kick in and just kill the application. On Tuesday, May 28, 2024 at 4:04:15 PM UTC-7 robert engels wrote: Feels like a memory leak to me. I would look for growing heap size in the gc logs. I am guessing that the system is not completely hung - but rather the runtime is having a hard time obtaining more memory, so it is slowing the allocators to a rate that makes them appear hung. It may be that the process has consumed nearly all of the OS memory too - so the OS is having a hard-time responding to malloc requests. i.e. The system is not making progress. On May 28, 2024, at 3:54 PM, 'Alex Kristiansen' via golang-nuts < golan...@googlegroups.com> wrote: This is an odd one. For reference, this is a customer machine, Windows server 2016, compiled with go1.20.11. The application just hangs after a number of days; windows minidump reveals that most threads are doing this: Goroutine 462 - User: unicode/utf16/utf16.go:106 unicode/utf16.Decode (0xe6e391) [semacquire] 0 0x0000000000e2d116 in runtime.gopark at runtime/proc.go:382 1 0x0000000000e3df5c in runtime.goparkunlock at runtime/proc.go:387 2 0x0000000000e3df5c in runtime.semacquire1 at runtime/sema.go:160 3 0x0000000000e0ac2f in runtime.semacquire at runtime/sema.go:111 4 0x0000000000e0ac2f in runtime.gcMarkDone at runtime/mgc.go:787 5 0x0000000000128c10 in ??? at ?:-1 6 0x0000000000dfe7da in runtime.deductAssistCredit at runtime/malloc.go:1217 7 0x0000000000dfdff0 in runtime.mallocgc at runtime/malloc.go:932 8 0x0000000000e3f972 in runtime.makeslice at runtime/slice.go:103 9 0x0000000000e6e391 in unicode/utf16.Decode at unicode/utf16/utf16.go:106 10 0x0000000000e72a7b in syscall.UTF16ToString at syscall/syscall_windows.go:63 11 0x0000000000eb7a67 in os.(*File).readdir at os/dir_windows.go:43 12 0x0000000000eb72c5 in os.(*File).Readdirnames at os/dir.go:70 13 0x0000000000fb623a in path/filepath.glob at path/filepath/match.go:346 14 0x0000000000fb5ea5 in path/filepath.globWithLimit at path/filepath/match.go:273 15 0x00000000031de255 in path/filepath.Glob at path/filepath/match.go:243 Multiple threads, all waiting on a semaphore inside mallocgc. The actual reason we're waiting for memory changes, obviously (sometimes a string method, or hashmap allocation, or logging call...). This behavior has remained consistent across multiple hangs across a number of weeks. After a bit of digging into the minidump, this (I think) is the thread that's holding the semaphore(s), at least based on the line numbers: Goroutine 37 - User: :0 ??? (0x7ffb0e056974) (thread 5628) 0 0x00007ffb0e056974 in ??? at ?:-1 1 0x0000000000e5c5a0 in runtime.systemstack_switch at runtime/asm_amd64.s:463 2 0x0000000000e0ade5 in runtime.gcMarkDone at runtime/mgc.go:855 3 0x0000000000128c10 in ??? at ?:-1 4 0x0000000000e5e881 in runtime.goexit at runtime/asm_amd64.s:1598 Note that the precise point in gcMarkDone can change. I have another minidump showing a hang at a different point in the same method: Goroutine 19 - Go: :0 ??? (0x208351516f8) (thread 7164) [unknown wait reason 30] 0 0x00007ffe66e66974 in ??? at ?:-1 1 0x0000000000e9c5a0 in runtime.systemstack_switch at runtime/asm_amd64.s:463 2 0x0000000000e4acff in runtime.gcMarkDone at runtime/mgc.go:807 3 0x0000000000128c10 in ??? at ?:-1 4 0x0000000000e9e881 in runtime.goexit at runtime/asm_amd64.s:1598 What's happening here is that a goroutine has switched to the OS thread stack and is running there in the runtime. It looks like a Windows minidump (unsurprisingly) can't follow the relation. A regular goroutine stack trace dump that gets emitted on a forced exit or crash is able to show both sides of the system stack switch with a little bit of configuration. On Linux, I would debug this by setting GOTRACEBACK=crash and then killing the hung application with a SIGABRT. I am not sure what the equivalent on Windows would be, but I can look into it. What GOTRACEBACK=crash does is it causes threads that the runtime cannot easily preempt to have their stack traces forcibly dumped on exit. The commonality between these treads appears to be the `stacktrace()` method. Does anyone have any ideas for how to further debug this? Has anyone seen anything like this? I've never seen anything like this before, and I can't really reproduce it, as the only behavior is "let the application sit and run for a few days." The only vaguely similar issue I've found is this, which mentions interference from AV software: https://github.com/golang/go/issues/52178. Right now I'm waiting for the results of a gctrace log, as well as setting GODEBUG=asyncpreemptoff=1, but other than that, I'm a bit out of ideas. +1, GODEBUG=asyncpreemptoff=1 is good to rule out, and a GODEBUG=gctrace=1 is always good go-to for understanding what's going on. Also, would you able to share a full stack trace of all goroutines (perhaps anonymized)? At that point, it may also be worth filing a bug at github.com/golang/go/issues/new. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/9c78ef6f-0190-4981-ab1f-ed18ad2f1080n%40googlegroups.com <https://groups.google.com/d/msgid/golang-nuts/9c78ef6f-0190-4981-ab1f-ed18ad2f1080n%40googlegroups.com?utm_medium=email&utm_source=footer> . -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/4d063278-64ee-47f6-84c9-37e18a4814a2n%40googlegroups.com <https://groups.google.com/d/msgid/golang-nuts/4d063278-64ee-47f6-84c9-37e18a4814a2n%40googlegroups.com?utm_medium=email&utm_source=footer> . -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/44b129bc-3522-4e2d-be8e-6399ac996a72n%40googlegroups.com.