On Fri, Nov 03, 2023 at 03:51:00PM -0400, Steven Sistare wrote: > On 11/3/2023 1:33 PM, Daniel P. Berrangé wrote: > > On Fri, Nov 03, 2023 at 09:01:29AM -0700, Steve Sistare wrote: > >> Buffered monitor output is lost when abort() is called. The pattern > >> error_report() followed by abort() occurs about 60 times, so valuable > >> information is being lost when the abort is called in the context of a > >> monitor command. > > > > I'm curious, was there a particular abort() scenario that you hit ? > > Yes, while tweaking the suspended state, and forgetting to add transitions: > > error_report("invalid runstate transition: '%s' -> '%s'", > abort(); > > But I have previously hit this for other errors. > > > For some crude statistics: > > > > $ for i in abort return exit goto ; do echo -n "$i: " ; git grep --after > > 1 error_report | grep $i | wc -l ; done > > abort: 47 > > return: 512 > > exit: 458 > > goto: 177 > > > > to me those numbers say that calling "abort()" after error_report > > should be considered a bug, and we can blanket replace all the > > abort() calls with exit(EXIT_FAILURE), and thus avoid the need to > > special case flushing the monitor. > > And presumably add an atexit handler to flush the monitor ala monitor_abort. > AFAICT currently no destructor is called for the monitor at exit time.
The HMP monitor flushes at each newline, and exit() will take care of flushing stdout, so I don't think there's anything else needed. > > Also I think there's a decent case to be made for error_report() > > to call monitor_flush(). > > A good start, but that would not help for monitors with skip_flush=true, > which > need to format the buffered string in a json response, which is the case I > tripped over. 'skip_flush' is only set to 'true' when using a QMP monitor and invoking "hmp-monitor-command". In such a case, the error message needs to be built into a JSON error reply and sent over the socket. Your patch doesn't help this case since you've just printed to stderr. I don't think it is reasonable to expect QMP monitors to send replies on SIG_ABRT anyway. So I don't think the skip_flush=true scenario is a problem to be concerned with. > >> To fix, install a SIGABRT handler to flush the monitor buffer to stderr. > >> > >> Signed-off-by: Steve Sistare <steven.sist...@oracle.com> > >> --- > >> monitor/monitor.c | 38 ++++++++++++++++++++++++++++++++++++++ > >> 1 file changed, 38 insertions(+) > >> > >> diff --git a/monitor/monitor.c b/monitor/monitor.c > >> index dc352f9..65dace0 100644 > >> --- a/monitor/monitor.c > >> +++ b/monitor/monitor.c > >> @@ -701,6 +701,43 @@ void monitor_cleanup(void) > >> } > >> } > >> > >> +#ifdef CONFIG_LINUX > >> + > >> +static void monitor_abort(int signal, siginfo_t *info, void *c) > >> +{ > >> + Monitor *mon = monitor_cur(); > >> + > >> + if (!mon || qemu_mutex_trylock(&mon->mon_lock)) { > >> + return; > >> + } > >> + > >> + if (mon->outbuf && mon->outbuf->len) { > >> + fputs("SIGABRT received: ", stderr); > >> + fputs(mon->outbuf->str, stderr); > >> + if (mon->outbuf->str[mon->outbuf->len - 1] != '\n') { > >> + fputc('\n', stderr); > >> + } > >> + } > >> + > >> + qemu_mutex_unlock(&mon->mon_lock); > > > > The SIGABRT handling does not only fire in response to abort() > > calls, but also in response to bad memory scenarios, so we have > > to be careful what we do in signal handlers. > > > > In particular using mutexes in signal handlers is a big red > > flag generally. Mutex APIs are not declare async signal > > safe, so this code is technically a POSIX compliance > > violation. > > Righto. I would need to mask all signals in the sigaction to be on the > safe(r) side. This is still doomed, because SIGABRT could fire while 'mon_lock' is already held, and so this code would deadlock trying to acquire the lock. > > So I think we'd be safer just eliminating the explicit abort() > > calls and adding monitor_flush call to error_report. > > I like adding a handler because it is future proof. No need to play > whack-a-mole when > developers re-introduce abort() calls in the future. A minor benefit is I > would not > need ack's from 50 maintainers to change 50 call sites from abort to exit. That's a bit of a crazy exaggeration. THe aborts() don't cover 50 different subsystems, and we don't require explicit acks from every subsystem maintainer for trivial cleanups like this. > A slight risk of the exit solution is that something bad happened at the call > site, so > qemu state can no longer be trusted. Calling abort immediately may be safer > than calling > exit which will call the existing atexit handlers and could have side effects. If that was a real problem, then we already face it because we have ~500 places already calling exit() and only 50 calling abort(). > A third option is to define qemu_abort() which flushes the monitor, and > replaces all abort > calls. That avoids async-signal-mutex hand wringing, but is still subject to > whack-a-mole. > > So: atexit, signal handler, or qemu_abort? I will go with your preference. Just replace abort -> exit. I'm not seeing a need for an atexit handler on top. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|