On 11/3/2023 3:51 PM, Steven Sistare wrote: > On 11/3/2023 1:33 PM, Daniel P. Berrangé wrote: >> On Fri, Nov 03, 2023 at 09:01:29AM -0700, Steve Sistare wrote: >>> Buffered monitor output is lost when abort() is called. The pattern >>> error_report() followed by abort() occurs about 60 times, so valuable >>> information is being lost when the abort is called in the context of a >>> monitor command. >> >> I'm curious, was there a particular abort() scenario that you hit ? > > Yes, while tweaking the suspended state, and forgetting to add transitions: > > error_report("invalid runstate transition: '%s' -> '%s'", > abort(); > > But I have previously hit this for other errors. > >> For some crude statistics: >> >> $ for i in abort return exit goto ; do echo -n "$i: " ; git grep --after 1 >> error_report | grep $i | wc -l ; done >> abort: 47 >> return: 512 >> exit: 458 >> goto: 177 >> >> to me those numbers say that calling "abort()" after error_report >> should be considered a bug, and we can blanket replace all the >> abort() calls with exit(EXIT_FAILURE), and thus avoid the need to >> special case flushing the monitor. > > And presumably add an atexit handler to flush the monitor ala monitor_abort. > AFAICT currently no destructor is called for the monitor at exit time. > >> Also I think there's a decent case to be made for error_report() >> to call monitor_flush(). > > A good start, but that would not help for monitors with skip_flush=true, > which > need to format the buffered string in a json response, which is the case I > tripped over. > >>> To fix, install a SIGABRT handler to flush the monitor buffer to stderr. >>> >>> Signed-off-by: Steve Sistare <steven.sist...@oracle.com> >>> --- >>> monitor/monitor.c | 38 ++++++++++++++++++++++++++++++++++++++ >>> 1 file changed, 38 insertions(+) >>> >>> diff --git a/monitor/monitor.c b/monitor/monitor.c >>> index dc352f9..65dace0 100644 >>> --- a/monitor/monitor.c >>> +++ b/monitor/monitor.c >>> @@ -701,6 +701,43 @@ void monitor_cleanup(void) >>> } >>> } >>> >>> +#ifdef CONFIG_LINUX >>> + >>> +static void monitor_abort(int signal, siginfo_t *info, void *c) >>> +{ >>> + Monitor *mon = monitor_cur(); >>> + >>> + if (!mon || qemu_mutex_trylock(&mon->mon_lock)) { >>> + return; >>> + } >>> + >>> + if (mon->outbuf && mon->outbuf->len) { >>> + fputs("SIGABRT received: ", stderr); >>> + fputs(mon->outbuf->str, stderr); >>> + if (mon->outbuf->str[mon->outbuf->len - 1] != '\n') { >>> + fputc('\n', stderr); >>> + } >>> + } >>> + >>> + qemu_mutex_unlock(&mon->mon_lock); >> >> The SIGABRT handling does not only fire in response to abort() >> calls, but also in response to bad memory scenarios, so we have >> to be careful what we do in signal handlers. >> >> In particular using mutexes in signal handlers is a big red >> flag generally. Mutex APIs are not declare async signal >> safe, so this code is technically a POSIX compliance >> violation. > > Righto. I would need to mask all signals in the sigaction to be on the > safe(r) side. > >> So I think we'd be safer just eliminating the explicit abort() >> calls and adding monitor_flush call to error_report. > > I like adding a handler because it is future proof. No need to play > whack-a-mole when > developers re-introduce abort() calls in the future. A minor benefit is I > would not > need ack's from 50 maintainers to change 50 call sites from abort to exit. > > A slight risk of the exit solution is that something bad happened at the call > site, so > qemu state can no longer be trusted. Calling abort immediately may be safer > than calling > exit which will call the existing atexit handlers and could have side effects. > > A third option is to define qemu_abort() which flushes the monitor, and > replaces all abort > calls. That avoids async-signal-mutex hand wringing, but is still subject to > whack-a-mole. > > So: atexit, signal handler, or qemu_abort? I will go with your preference.
If I go with signal handler, I would also add atexit to flush the existing call sites of error_report() + exit(). One more tidbit: the signal handler would print pending messages if any of the 11000 assert() calls fails, though having a message present is less likely in this case. - Steve >>> +} >>> + >>> +static void monitor_add_abort_handler(void) >>> +{ >>> + struct sigaction act; >>> + >>> + memset(&act, 0, sizeof(act)); >>> + act.sa_sigaction = monitor_abort; >>> + act.sa_flags = SA_SIGINFO; >>> + sigaction(SIGABRT, &act, NULL); >>> +} >>> + >>> +#else >>> + >>> +static void monitor_add_abort_handler(void) {} >>> + >>> +#endif >>> + >>> static void monitor_qapi_event_init(void) >>> { >>> monitor_qapi_event_state = g_hash_table_new(qapi_event_throttle_hash, >>> @@ -712,6 +749,7 @@ void monitor_init_globals(void) >>> monitor_qapi_event_init(); >>> qemu_mutex_init(&monitor_lock); >>> coroutine_mon = g_hash_table_new(NULL, NULL); >>> + monitor_add_abort_handler(); >>> >>> /* >>> * The dispatcher BH must run in the main loop thread, since we >>> -- >>> 1.8.3.1 >>> >>> >> >> With regards, >> Daniel