Am 29.01.2021 um 13:53 hat Markus Armbruster geschrieben: > I ran into odd behavior the other day, and bisected it to this commit. > > $ qemu-system-x86_64 -display none -chardev > socket,id=qmp,path=test-qmp,server=on,wait=off -mon mode=control,chardev=qmp > > In another terminal, create a bunch of FIFOs, then use them to have some > in-band commands block, with out-of-band commands interleaved just > because: > > $ for ((i=0; i<20; i++)); do mkfifo fifo$i; done > $ cat oob-test2 > {"execute": "qmp_capabilities", "arguments": {"enable": ["oob"]}} > {"exec-oob": "migrate-pause", "id": 0} > {"execute": "memsave", "id": 1, "arguments": {"val": 0, "size": 4096, > "filename": "fifo1"}} > {"exec-oob": "migrate-pause", "id": 2} > {"execute": "memsave", "id": 3, "arguments": {"val": 0, "size": 4096, > "filename": "fifo3"}} > {"exec-oob": "migrate-pause", "id": 4} > {"execute": "memsave", "id": 5, "arguments": {"val": 0, "size": 4096, > "filename": "fifo5"}} > {"exec-oob": "migrate-pause", "id": 6} > {"execute": "memsave", "id": 7, "arguments": {"val": 0, "size": 4096, > "filename": "fifo7"}} > {"exec-oob": "migrate-pause", "id": 8} > {"execute": "memsave", "id": 9, "arguments": {"val": 0, "size": 4096, > "filename": "fifo9"}} > {"exec-oob": "migrate-pause", "id": 10} > {"execute": "memsave", "id": 11, "arguments": {"val": 0, "size": 4096, > "filename": "fifo11"}} > {"exec-oob": "migrate-pause", "id": 12} > {"execute": "memsave", "id": 13, "arguments": {"val": 0, "size": 4096, > "filename": "fifo13"}} > {"exec-oob": "migrate-pause", "id": 14} > {"execute": "memsave", "id": 15, "arguments": {"val": 0, "size": 4096, > "filename": "fifo15"}} > {"exec-oob": "migrate-pause", "id": 16} > {"execute": "memsave", "id": 17, "arguments": {"val": 0, "size": 4096, > "filename": "fifo17"}} > {"exec-oob": "migrate-pause", "id": 18} > {"execute": "memsave", "id": 19, "arguments": {"val": 0, "size": 4096, > "filename": "fifo19"}} > {"exec-oob": "migrate-pause", "id": 20} > $ socat -t99999 STDIO UNIX-CONNECT:$HOME/work/images/test-qmp <oob-test2 > {"QMP": {"version": {"qemu": {"micro": 50, "minor": 1, "major": 5}, > "package": "v5.1.0-2224-g8db1efd3f3"}, "capabilities": ["oob"]}} > {"return": {}} > {"id": 0, "error": {"class": "GenericError", "desc": "migrate-pause is > currently only supported during postcopy-active state"}} > {"id": 2, "error": {"class": "GenericError", "desc": "migrate-pause is > currently only supported during postcopy-active state"}} > {"id": 4, "error": {"class": "GenericError", "desc": "migrate-pause is > currently only supported during postcopy-active state"}} > {"id": 6, "error": {"class": "GenericError", "desc": "migrate-pause is > currently only supported during postcopy-active state"}} > {"id": 8, "error": {"class": "GenericError", "desc": "migrate-pause is > currently only supported during postcopy-active state"}} > {"id": 10, "error": {"class": "GenericError", "desc": "migrate-pause is > currently only supported during postcopy-active state"}} > {"id": 12, "error": {"class": "GenericError", "desc": "migrate-pause is > currently only supported during postcopy-active state"}} > {"id": 14, "error": {"class": "GenericError", "desc": "migrate-pause is > currently only supported during postcopy-active state"}} > {"id": 16, "error": {"class": "GenericError", "desc": "migrate-pause is > currently only supported during postcopy-active state"}} > > Looking good: the out-of-band commands jump the queue until the queue is > too full for jumping. > > Now go back to the first terminal, and hit C-c. > > Before this commit, the second terminal shows the shutdown event > > {"timestamp": {"seconds": 1611923623, "microseconds": 528169}, "event": > "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}} > > and the first terminal shows > > ^Cqemu-system-x86_64: terminating on signal 2 > > QEMU terminates with exit status 0. Good (except for the exit status, > but let's ignore that). > > After the commit, the second terminal additionally shows the error reply > for (in-band) command 1 > > {"id": 1, "error": {"class": "GenericError", "desc": "Could not open > 'fifo1': Interrupted system call"}} > {"timestamp": {"seconds": 1611923812, "microseconds": 520891}, "event": > "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}} > > and the first terminal still shows > > ^Cqemu-system-x86_64: terminating on signal 2 > > However, QEMU does *not* terminate. When I hit C-c again, the second > terminal gives me the next in-band reply > > {"id": 3, "error": {"class": "GenericError", "desc": "Could not open > 'fifo3': Interrupted system call"}} > > and the first one another > > ^C > > Hitting C-c some more gives me more in-band replies and more ^C.
The problem is that monitor_qmp_dispatcher_co() doesn't check whether it should shut down unless it would have to wait for a new request. So 'memsave' tries to open the FIFO, this blocks and ^C results in EINTR for the open(), which makes the 'memsave' command fail. Then monitor_qmp_dispatcher_co() tries to execute the rest of the queued commands, i.e. the next 'memsave' that will hang in the same way. Fixing this is easy enough: diff --git a/monitor/qmp.c b/monitor/qmp.c index 43880fa623..46939537b4 100644 --- a/monitor/qmp.c +++ b/monitor/qmp.c @@ -227,6 +227,10 @@ void coroutine_fn monitor_qmp_dispatcher_co(void *data) */ qatomic_mb_set(&qmp_dispatcher_co_busy, false); + if (qmp_dispatcher_co_shutdown) { + return; + } + while (!(req_obj = monitor_qmp_requests_pop_any_with_lock())) { /* * No more requests to process. Wait to be reentered from > The ninth C-c gives me the error reply for (in-band) command 17, and a crash: > > Terminal 1 now shows > > ^Cqemu-system-x86_64-qemu: terminating on signal 2 > ^C^C^C^C^C^C^C^Cqemu-system-x86_64-qemu: ../util/async.c:343: > aio_ctx_finalize: Assertion `flags & BH_DELETED' failed. > Aborted (core dumped) So all of this happens inside of monitor_cleanup(), while waiting for monitor_qmp_dispatcher_co() to shut down: AIO_WAIT_WHILE(qemu_get_aio_context(), (aio_poll(iohandler_get_aio_context(), false), qatomic_mb_read(&qmp_dispatcher_co_busy))); Importantly, this is _after_ calling iothread_stop(), which made sure that all pending BHs in the monitor iothread are executed. What now happens is that monitor_qmp_dispatcher_co() wants to resume the monitor. This schedules a new BH on the iothread, which was already supposed to be inactive. When finally all requests are handled and monitor_cleanup() continues after the polling loop and calls iothread_destroy(), we notice that there is a pending BH where there shouldn't be any and abort. I think this means that the commit should have moved even the iothread_stop() call to below the polling loop. I can't reproduce the problem any more with the fix above, but I think the current order in monitor_cleanup() is still a (possibly latent) bug. So why did all of that work before 357bda95? If the old code didn't crash as described in the commit message, it would just free all kinds of monitor resources while the coroutine was still running. This includes removing all pending requests from the queue. I guess this is what accidentally made it "work" previously. Kevin