Kevin Wolf <kw...@redhat.com> writes: > Am 29.01.2021 um 13:53 hat Markus Armbruster geschrieben: >> I ran into odd behavior the other day, and bisected it to this commit. >> >> $ qemu-system-x86_64 -display none -chardev >> socket,id=qmp,path=test-qmp,server=on,wait=off -mon mode=control,chardev=qmp >> >> In another terminal, create a bunch of FIFOs, then use them to have some >> in-band commands block, with out-of-band commands interleaved just >> because: >> >> $ for ((i=0; i<20; i++)); do mkfifo fifo$i; done >> $ cat oob-test2 >> {"execute": "qmp_capabilities", "arguments": {"enable": ["oob"]}} >> {"exec-oob": "migrate-pause", "id": 0} >> {"execute": "memsave", "id": 1, "arguments": {"val": 0, "size": 4096, >> "filename": "fifo1"}} >> {"exec-oob": "migrate-pause", "id": 2} >> {"execute": "memsave", "id": 3, "arguments": {"val": 0, "size": 4096, >> "filename": "fifo3"}} >> {"exec-oob": "migrate-pause", "id": 4} >> {"execute": "memsave", "id": 5, "arguments": {"val": 0, "size": 4096, >> "filename": "fifo5"}} >> {"exec-oob": "migrate-pause", "id": 6} >> {"execute": "memsave", "id": 7, "arguments": {"val": 0, "size": 4096, >> "filename": "fifo7"}} >> {"exec-oob": "migrate-pause", "id": 8} >> {"execute": "memsave", "id": 9, "arguments": {"val": 0, "size": 4096, >> "filename": "fifo9"}} >> {"exec-oob": "migrate-pause", "id": 10} >> {"execute": "memsave", "id": 11, "arguments": {"val": 0, "size": 4096, >> "filename": "fifo11"}} >> {"exec-oob": "migrate-pause", "id": 12} >> {"execute": "memsave", "id": 13, "arguments": {"val": 0, "size": 4096, >> "filename": "fifo13"}} >> {"exec-oob": "migrate-pause", "id": 14} >> {"execute": "memsave", "id": 15, "arguments": {"val": 0, "size": 4096, >> "filename": "fifo15"}} >> {"exec-oob": "migrate-pause", "id": 16} >> {"execute": "memsave", "id": 17, "arguments": {"val": 0, "size": 4096, >> "filename": "fifo17"}} >> {"exec-oob": "migrate-pause", "id": 18} >> {"execute": "memsave", "id": 19, "arguments": {"val": 0, "size": 4096, >> "filename": "fifo19"}} >> {"exec-oob": "migrate-pause", "id": 20} >> $ socat -t99999 STDIO UNIX-CONNECT:$HOME/work/images/test-qmp <oob-test2 >> {"QMP": {"version": {"qemu": {"micro": 50, "minor": 1, "major": 5}, >> "package": "v5.1.0-2224-g8db1efd3f3"}, "capabilities": ["oob"]}} >> {"return": {}} >> {"id": 0, "error": {"class": "GenericError", "desc": "migrate-pause is >> currently only supported during postcopy-active state"}} >> {"id": 2, "error": {"class": "GenericError", "desc": "migrate-pause is >> currently only supported during postcopy-active state"}} >> {"id": 4, "error": {"class": "GenericError", "desc": "migrate-pause is >> currently only supported during postcopy-active state"}} >> {"id": 6, "error": {"class": "GenericError", "desc": "migrate-pause is >> currently only supported during postcopy-active state"}} >> {"id": 8, "error": {"class": "GenericError", "desc": "migrate-pause is >> currently only supported during postcopy-active state"}} >> {"id": 10, "error": {"class": "GenericError", "desc": "migrate-pause is >> currently only supported during postcopy-active state"}} >> {"id": 12, "error": {"class": "GenericError", "desc": "migrate-pause is >> currently only supported during postcopy-active state"}} >> {"id": 14, "error": {"class": "GenericError", "desc": "migrate-pause is >> currently only supported during postcopy-active state"}} >> {"id": 16, "error": {"class": "GenericError", "desc": "migrate-pause is >> currently only supported during postcopy-active state"}} >> >> Looking good: the out-of-band commands jump the queue until the queue is >> too full for jumping. >> >> Now go back to the first terminal, and hit C-c. >> >> Before this commit, the second terminal shows the shutdown event >> >> {"timestamp": {"seconds": 1611923623, "microseconds": 528169}, "event": >> "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}} >> >> and the first terminal shows >> >> ^Cqemu-system-x86_64: terminating on signal 2 >> >> QEMU terminates with exit status 0. Good (except for the exit status, >> but let's ignore that). >> >> After the commit, the second terminal additionally shows the error reply >> for (in-band) command 1 >> >> {"id": 1, "error": {"class": "GenericError", "desc": "Could not open >> 'fifo1': Interrupted system call"}} >> {"timestamp": {"seconds": 1611923812, "microseconds": 520891}, "event": >> "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}} >> >> and the first terminal still shows >> >> ^Cqemu-system-x86_64: terminating on signal 2 >> >> However, QEMU does *not* terminate. When I hit C-c again, the second >> terminal gives me the next in-band reply >> >> {"id": 3, "error": {"class": "GenericError", "desc": "Could not open >> 'fifo3': Interrupted system call"}} >> >> and the first one another >> >> ^C >> >> Hitting C-c some more gives me more in-band replies and more ^C. > > The problem is that monitor_qmp_dispatcher_co() doesn't check whether it > should shut down unless it would have to wait for a new request. > > So 'memsave' tries to open the FIFO, this blocks and ^C results in EINTR > for the open(), which makes the 'memsave' command fail.
Perhaps the command should retry after EINTR. Out of scope here. > Then > monitor_qmp_dispatcher_co() tries to execute the rest of the queued > commands, i.e. the next 'memsave' that will hang in the same way. > > Fixing this is easy enough: > > diff --git a/monitor/qmp.c b/monitor/qmp.c > index 43880fa623..46939537b4 100644 > --- a/monitor/qmp.c > +++ b/monitor/qmp.c > @@ -227,6 +227,10 @@ void coroutine_fn monitor_qmp_dispatcher_co(void *data) > */ > qatomic_mb_set(&qmp_dispatcher_co_busy, false); > > + if (qmp_dispatcher_co_shutdown) { > + return; > + } > + > while (!(req_obj = monitor_qmp_requests_pop_any_with_lock())) { > /* > * No more requests to process. Wait to be reentered from > >> The ninth C-c gives me the error reply for (in-band) command 17, and a crash: >> >> Terminal 1 now shows >> >> ^Cqemu-system-x86_64-qemu: terminating on signal 2 >> ^C^C^C^C^C^C^C^Cqemu-system-x86_64-qemu: ../util/async.c:343: >> aio_ctx_finalize: Assertion `flags & BH_DELETED' failed. >> Aborted (core dumped) > > So all of this happens inside of monitor_cleanup(), while waiting for > monitor_qmp_dispatcher_co() to shut down: > > AIO_WAIT_WHILE(qemu_get_aio_context(), > (aio_poll(iohandler_get_aio_context(), false), > qatomic_mb_read(&qmp_dispatcher_co_busy))); > > Importantly, this is _after_ calling iothread_stop(), which made sure > that all pending BHs in the monitor iothread are executed. > > What now happens is that monitor_qmp_dispatcher_co() wants to resume > the monitor. This schedules a new BH on the iothread, which was already > supposed to be inactive. > > When finally all requests are handled and monitor_cleanup() continues > after the polling loop and calls iothread_destroy(), we notice that > there is a pending BH where there shouldn't be any and abort. > > I think this means that the commit should have moved even the > iothread_stop() call to below the polling loop. I can't reproduce the > problem any more with the fix above, but I think the current order in > monitor_cleanup() is still a (possibly latent) bug. > > So why did all of that work before 357bda95? > > If the old code didn't crash as described in the commit message, it > would just free all kinds of monitor resources while the coroutine was > still running. This includes removing all pending requests from the > queue. I guess this is what accidentally made it "work" previously. Awesome. Thanks for writing up your analysis!