Thanks for the response Michael. > It looks like a Windows minidump (unsurprisingly) can't follow the relation. A regular goroutine stack trace dump that gets emitted on a forced exit or crash is able to show both sides of the system stack switch with a little bit of configuration.
Yeah, that was my understanding too. Again, not a windows expert, but my understanding is that there's no equivalent to the "send SIGABRT to get a stack dump" trick on windows? I didn't file a bug initially, as this version appears to be more than two minor releases behind, and I also wasn't completely sure if it was a bug or some weird AV/VM behavior. The entire stack trace is pretty huge, so I put it here: https://gist.github.com/fearful-symmetry/1e6546640f667731b3fe2a90f96b5680 On Wednesday, May 29, 2024 at 9:26:50 PM UTC-7 Michael Knyszek wrote: > The example goroutine in the original post is parked waiting for some > other goroutine to finish up the GC cycle. Somewhere, a goroutine is > getting stuck trying to finish it up, which could possibly be a deadlock. > (I am especially suspicious of a deadlock bug because most threads are > stopped there.) It would be good to confirm whether application CPU time > drops to zero or plateaus to some very consistent number, which would > support the deadlock theory. > > On Wednesday, May 29, 2024 at 11:09:30 AM UTC-4 robert engels wrote: > > I am pretty sure runtime is supposed to crash the process if it slows the > allocators “too much” (I believe there are some config settings to control > this). > > The runtime never intentionally crashes a process due to rate or volume of > allocation. There is an optional memory limit, but it's soft in that the > runtime gives up trying to maintain the limit at some point; see below and > see https://go.dev/doc/gc-guide. > > > If you have enough Go routines it may look like they are hung - you need > to track specific routines by their ID. The stack certainly looks like it > is trying to allocate memory for a slice - and it is being paused - because > it is waiting on the GC lock. > > Are you doing any native code? You could have heap corruption and that is > also why the GC thread is getting hung. > > If the GC thread makes progress albeit slowly - I would settle on my first > guess - memory leak... > > I suspect a memory leak is not at play here. In the case of a memory leak, > by default, your live heap may steadily increase, and the time between GC > cycles increases proportionally. The GC is designed such that the overall > impact should stay roughly constant as your heap grows by default. In this > case, memory leak should appear exactly as expected: a rise in memory use > over time. > > If GOMEMLIMIT is in use (which is non-default; you have to ask for it), > then a memory leak may result in progressively more frequent GC cycles. > However, the worst-case failure to make progress ("GC death spiral") should > be prevented in practice by the runtime already. It puts a hard limit on > the amount of CPU time that can be taken by the GC every CPU-second, > instead letting memory use increase. Don't get me wrong, it can still slow > down the application a good bit, and maybe under some circumstances it'll > still look like a hang. It is also possible there's a bug in the GC CPU > limiter or something. > > But, from what I've seen so far in this bug, I'm more convinced of a > deadlock bug than the application failing to make progress. Lots of > goroutines lining up on gcMarkDone just sounds too specific to me. > > > On May 29, 2024, at 10:00 AM, 'Alex Kristiansen' via golang-nuts < > golan...@googlegroups.com> wrote: > > That's an interesting idea, I probably wouldn't have thought of that on my > own. Is that expected behavior for memory pressure on Windows+golang? I > don't have much windows experience, so my assumption would be that the > Windows equivalent of the OOMKiller would kick in and just kill the > application. > > On Tuesday, May 28, 2024 at 4:04:15 PM UTC-7 robert engels wrote: > > Feels like a memory leak to me. I would look for growing heap size in the > gc logs. I am guessing that the system is not completely hung - but rather > the runtime is having a hard time obtaining more memory, so it is slowing > the allocators to a rate that makes them appear hung. > > It may be that the process has consumed nearly all of the OS memory too - > so the OS is having a hard-time responding to malloc requests. > > i.e. The system is not making progress. > > On May 28, 2024, at 3:54 PM, 'Alex Kristiansen' via golang-nuts < > golan...@googlegroups.com> wrote: > > This is an odd one. For reference, this is a customer machine, Windows > server 2016, compiled with go1.20.11. The application just hangs after a > number of days; windows minidump reveals that most threads are doing this: > > Goroutine 462 - User: unicode/utf16/utf16.go:106 unicode/utf16.Decode > (0xe6e391) [semacquire] > 0 0x0000000000e2d116 in runtime.gopark > at runtime/proc.go:382 > 1 0x0000000000e3df5c in runtime.goparkunlock > at runtime/proc.go:387 > 2 0x0000000000e3df5c in runtime.semacquire1 > at runtime/sema.go:160 > 3 0x0000000000e0ac2f in runtime.semacquire > at runtime/sema.go:111 > 4 0x0000000000e0ac2f in runtime.gcMarkDone > at runtime/mgc.go:787 > 5 0x0000000000128c10 in ??? > at ?:-1 > 6 0x0000000000dfe7da in runtime.deductAssistCredit > at runtime/malloc.go:1217 > 7 0x0000000000dfdff0 in runtime.mallocgc > at runtime/malloc.go:932 > 8 0x0000000000e3f972 in runtime.makeslice > at runtime/slice.go:103 > 9 0x0000000000e6e391 in unicode/utf16.Decode > at unicode/utf16/utf16.go:106 > 10 0x0000000000e72a7b in syscall.UTF16ToString > at syscall/syscall_windows.go:63 > 11 0x0000000000eb7a67 in os.(*File).readdir > at os/dir_windows.go:43 > 12 0x0000000000eb72c5 in os.(*File).Readdirnames > at os/dir.go:70 > 13 0x0000000000fb623a in path/filepath.glob > at path/filepath/match.go:346 > 14 0x0000000000fb5ea5 in path/filepath.globWithLimit > at path/filepath/match.go:273 > 15 0x00000000031de255 in path/filepath.Glob > at path/filepath/match.go:243 > > Multiple threads, all waiting on a semaphore inside mallocgc. The actual > reason we're waiting for memory changes, obviously (sometimes a string > method, or hashmap allocation, or logging call...). This behavior has > remained consistent across multiple hangs across a number of weeks. After a > bit of digging into the minidump, this (I think) is the thread that's > holding the semaphore(s), at least based on the line numbers: > > Goroutine 37 - User: :0 ??? (0x7ffb0e056974) (thread 5628) > 0 0x00007ffb0e056974 in ??? > at ?:-1 > 1 0x0000000000e5c5a0 in runtime.systemstack_switch > at runtime/asm_amd64.s:463 > 2 0x0000000000e0ade5 in runtime.gcMarkDone > at runtime/mgc.go:855 > 3 0x0000000000128c10 in ??? > at ?:-1 > 4 0x0000000000e5e881 in runtime.goexit > at runtime/asm_amd64.s:1598 > > Note that the precise point in gcMarkDone can change. I have another > minidump showing a hang at a different point in the same method: > > Goroutine 19 - Go: :0 ??? (0x208351516f8) (thread 7164) [unknown wait > reason 30] > 0 0x00007ffe66e66974 in ??? > at ?:-1 > 1 0x0000000000e9c5a0 in runtime.systemstack_switch > at runtime/asm_amd64.s:463 > 2 0x0000000000e4acff in runtime.gcMarkDone > at runtime/mgc.go:807 > 3 0x0000000000128c10 in ??? > at ?:-1 > 4 0x0000000000e9e881 in runtime.goexit > at runtime/asm_amd64.s:1598 > > What's happening here is that a goroutine has switched to the OS thread > stack and is running there in the runtime. It looks like a Windows minidump > (unsurprisingly) can't follow the relation. A regular goroutine stack trace > dump that gets emitted on a forced exit or crash is able to show both sides > of the system stack switch with a little bit of configuration. > > On Linux, I would debug this by setting GOTRACEBACK=crash and then killing > the hung application with a SIGABRT. I am not sure what the equivalent on > Windows would be, but I can look into it. What GOTRACEBACK=crash does is it > causes threads that the runtime cannot easily preempt to have their stack > traces forcibly dumped on exit. > > The commonality between these treads appears to be the `stacktrace()` > method. > > Does anyone have any ideas for how to further debug this? Has anyone seen > anything like this? I've never seen anything like this before, and I can't > really reproduce it, as the only behavior is "let the application sit and > run for a few days." The only vaguely similar issue I've found is this, > which mentions interference from AV software: > https://github.com/golang/go/issues/52178. > > Right now I'm waiting for the results of a gctrace log, as well as setting > GODEBUG=asyncpreemptoff=1, but other than that, I'm a bit out of ideas. > > +1, GODEBUG=asyncpreemptoff=1 is good to rule out, and a GODEBUG=gctrace=1 > is always good go-to for understanding what's going on. > > Also, would you able to share a full stack trace of all goroutines > (perhaps anonymized)? At that point, it may also be worth filing a bug at > github.com/golang/go/issues/new. > > > -- > You received this message because you are subscribed to the Google Groups > "golang-nuts" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-nuts...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/golang-nuts/9c78ef6f-0190-4981-ab1f-ed18ad2f1080n%40googlegroups.com > > <https://groups.google.com/d/msgid/golang-nuts/9c78ef6f-0190-4981-ab1f-ed18ad2f1080n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > > > -- > You received this message because you are subscribed to the Google Groups > "golang-nuts" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-nuts...@googlegroups.com. > > To view this discussion on the web visit > https://groups.google.com/d/msgid/golang-nuts/4d063278-64ee-47f6-84c9-37e18a4814a2n%40googlegroups.com > > <https://groups.google.com/d/msgid/golang-nuts/4d063278-64ee-47f6-84c9-37e18a4814a2n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/29e93ea4-650c-42a8-a267-957dcb16ca78n%40googlegroups.com.