On Wed, Jul 31, 2019 at 12:26 PM Harald Sitter <sit...@kde.org> wrote: > > Moin Moin! > > I've been haunting down a nasty backtrace problem in drkonqi where it > entirely fails to create a backtrace and am now fairly confident this > is in fact a design flaw with kcrash, but I have no awesome ideas on > how to solve this properly. > > Long story short: there is a space of time between SEGV occurring and > drkonqi stopping the threads. This causes (e.g.) GIO threads to > actively unavoidably crash the process. Most recently this could/can > be observed with plasmashell which has a GIO thread sitting around > when (I think) flatpak updates are being checked. The result is that > the crash cannot be traced because the process dies before drkonqi has > a chance to deal with it. > > If you have ever seen a warning or error of the kind "XCB connection > lost" or something similar it is in fact the very same problem, albeit > usually not fatal. > > When a process crashes SEGV is sent to any one thread. The other > threads continue to run! > When the SEGV arrives the standard handler will possibly restart the > process, then close all open file descriptors, potentially start (and > wait for) drkonqi and when drkonqi has worked its magic raise itself > to a core pattern process if applicable [1]. > The threads have still not been suspended! > When drkonqi starts, it sends STOP to the crashed process. STOP is > delivered to every thread, thus stopping everything this time around. > Only now is the process "safe" from crashing while crashing. > > And that's the race right there. In between the file descriptors > getting closed and the STOPping the threads that aren't being handled > and continue to run to potentially access the now-closed file > descriptors. In GIO's case it can try to read inotify events and run > into an error (e.g. in ik_source_read_some_events) and g_error, which > as far as I can tell will result in a TRAP because g_error almost > always(?) ends in g_abort. > > The solution is simply: we shouldn't close FDs before all threads are stopped. > > Practically I can't think of a way to actually pull this off though. > We'd need to close the FDs *at* STOP. But STOP like KILL cannot be > handled. > > I think the actual solution here would need to be that kcrash stops > invoking drkonqi and instead defers to a core handler through which > drkonqi can get access to the core. > Trouble is that there can only be one core handler and there are more > software providers on a system than just us, so I guess this isn't > really a viable solution :/ > Also the core stuff isn't too portable I think.
Well, yes. It's a complex issue as we're dealing with a dying process. My impression is that relying on core handlers makes a lot of sense, there would be some questions to answer such as "what happens when running on other systems". Maybe for now we could try doing an in-between? Handling cores on plasma and using drkonqi as we do now otherwise? Does drkonqi work nowadays at all on systems that aren't Linux/BSD? Aleix