Daniel P. Berrangé <berra...@redhat.com> writes: > On Fri, Nov 03, 2023 at 03:51:00PM -0400, Steven Sistare wrote: >> On 11/3/2023 1:33 PM, Daniel P. Berrangé wrote: >> > On Fri, Nov 03, 2023 at 09:01:29AM -0700, Steve Sistare wrote: >> >> Buffered monitor output is lost when abort() is called. The pattern >> >> error_report() followed by abort() occurs about 60 times, so valuable >> >> information is being lost when the abort is called in the context of a >> >> monitor command. >> > >> > I'm curious, was there a particular abort() scenario that you hit ? >> >> Yes, while tweaking the suspended state, and forgetting to add transitions: >> >> error_report("invalid runstate transition: '%s' -> '%s'", >> abort(); >> >> But I have previously hit this for other errors.
Can you provide a reproducer? >> > For some crude statistics: >> > >> > $ for i in abort return exit goto ; do echo -n "$i: " ; git grep --after >> > 1 error_report | grep $i | wc -l ; done >> > abort: 47 >> > return: 512 >> > exit: 458 >> > goto: 177 >> > >> > to me those numbers say that calling "abort()" after error_report >> > should be considered a bug, and we can blanket replace all the >> > abort() calls with exit(EXIT_FAILURE), and thus avoid the need to >> > special case flushing the monitor. >> >> And presumably add an atexit handler to flush the monitor ala monitor_abort. >> AFAICT currently no destructor is called for the monitor at exit time. > > The HMP monitor flushes at each newline, and exit() will take care of > flushing stdout, so I don't think there's anything else needed. Correct. >> > Also I think there's a decent case to be made for error_report() >> > to call monitor_flush(). No, because the messages printed by error_report() all end in newline, and printing a newline to a monitor triggers monitor_flush_locked(). >> A good start, but that would not help for monitors with skip_flush=true, >> which >> need to format the buffered string in a json response, which is the case I >> tripped over. > > 'skip_flush' is only set to 'true' when using a QMP monitor and invoking > "hmp-monitor-command". Correct. > In such a case, the error message needs to be built into a JSON error > reply and sent over the socket. Your patch doesn't help this case > since you've just printed to stderr. I don't think it is reasonable > to expect QMP monitors to send replies on SIG_ABRT anyway. So I don't > think the skip_flush=true scenario is a problem to be concerned with. > >> >> To fix, install a SIGABRT handler to flush the monitor buffer to stderr. >> >> >> >> Signed-off-by: Steve Sistare <steven.sist...@oracle.com> >> >> --- >> >> monitor/monitor.c | 38 ++++++++++++++++++++++++++++++++++++++ >> >> 1 file changed, 38 insertions(+) >> >> >> >> diff --git a/monitor/monitor.c b/monitor/monitor.c >> >> index dc352f9..65dace0 100644 >> >> --- a/monitor/monitor.c >> >> +++ b/monitor/monitor.c >> >> @@ -701,6 +701,43 @@ void monitor_cleanup(void) >> >> } >> >> } >> >> >> >> +#ifdef CONFIG_LINUX >> >> + >> >> +static void monitor_abort(int signal, siginfo_t *info, void *c) >> >> +{ >> >> + Monitor *mon = monitor_cur(); >> >> + >> >> + if (!mon || qemu_mutex_trylock(&mon->mon_lock)) { >> >> + return; >> >> + } >> >> + >> >> + if (mon->outbuf && mon->outbuf->len) { >> >> + fputs("SIGABRT received: ", stderr); >> >> + fputs(mon->outbuf->str, stderr); >> >> + if (mon->outbuf->str[mon->outbuf->len - 1] != '\n') { >> >> + fputc('\n', stderr); >> >> + } >> >> + } >> >> + >> >> + qemu_mutex_unlock(&mon->mon_lock); >> > >> > The SIGABRT handling does not only fire in response to abort() >> > calls, but also in response to bad memory scenarios, so we have >> > to be careful what we do in signal handlers. >> > >> > In particular using mutexes in signal handlers is a big red >> > flag generally. Mutex APIs are not declare async signal >> > safe, so this code is technically a POSIX compliance >> > violation. "Technically a POSIX compliance violation" sounds like something only pedants would care about. It's actually a recipe for deadlocks and crashes. >> Righto. I would need to mask all signals in the sigaction to be on the >> safe(r) side. > > This is still doomed, because SIGABRT could fire while 'mon_lock' is > already held, and so this code would deadlock trying to acquire the > lock. Yup. There is no way to make async signal unsafe code safe. >> > So I think we'd be safer just eliminating the explicit abort() >> > calls and adding monitor_flush call to error_report. >> >> I like adding a handler because it is future proof. No need to play >> whack-a-mole when >> developers re-introduce abort() calls in the future. A minor benefit is I >> would not >> need ack's from 50 maintainers to change 50 call sites from abort to exit. > > That's a bit of a crazy exaggeration. THe aborts() don't cover 50 different > subsystems, and we don't require explicit acks from every subsystem maintainer > for trivial cleanups like this. > >> A slight risk of the exit solution is that something bad happened at the >> call site, so >> qemu state can no longer be trusted. Calling abort immediately may be safer >> than calling >> exit which will call the existing atexit handlers and could have side >> effects. > > If that was a real problem, then we already face it because we have > ~500 places already calling exit() and only 50 calling abort(). > >> A third option is to define qemu_abort() which flushes the monitor, and >> replaces all abort >> calls. That avoids async-signal-mutex hand wringing, but is still subject >> to whack-a-mole. >> >> So: atexit, signal handler, or qemu_abort? I will go with your preference. > > Just replace abort -> exit. I disagree. Use exit(1) for fatal errors. Use abort() for programming errors (a.k.a. bugs). > I'm not seeing a need for an atexit handler on top. I'm not yet seeing a need for anything. A reproducer may change that.