On Thu, Mar 18, 2021 at 02:35:50PM +0100, Stefan Reiter wrote: > If OOB is disabled, events received in monitor_qmp_event will be handled > in the main context. Thus, we must not acquire a qmp_queue_lock there, > as the dispatcher coroutine holds one over a yield point, where it > expects to be rescheduled from the main context. If a CHR_EVENT_CLOSED > event is received just then, it can race and block the main thread by > waiting on the queue lock. > > Run monitor_qmp_cleanup_queue_and_resume in a BH on the iohandler > thread, so the main thread can always make progress during the > reschedule. > > The delaying of the cleanup is safe, since the dispatcher always moves > back to the iothread afterward, and thus the cleanup will happen before > it gets to its next iteration. > > Signed-off-by: Stefan Reiter <s.rei...@proxmox.com> > ---
This is a tough one. It *may* be fine, but I wonder if we can approach this differently: >From what I can gather we have the following call stacks & contexts: Guarded lock (lock & release): * monitor_qmp_cleanup_queue_and_resume by monitor_qmp_event by file handler (from I/O loop) ^ iohandler_context (assuming that's where the file handling happens...) (after this patch as BH though) * handle_qmp_command a) by the json parser (which is also re-initialized by monitor_qmp_event btw., haven't checked if that can also "trigger" its methods immediately) b) by monitor_qmp_read by file handler (from I/O loop) ^ iohandler_context Lock-"returning": * monitor_qmp_requests_pop_any_with_lock by coroutine_fn monitor_qmp_dispatcher_co ^ iohandler_context Lock-releasing: * coroutine_fn monitor_qmp_dispatcher_co ^ qemu_aio_context The only *weird* thing that immediately pops out here is `monitor_qmp_requests_pop_any_with_lock()` keeping a lock while switching contexts. This is done in order to allow `AIO_WAIT_WHILE` to work while making progress on the events, but do we actually already need to be in this context for the OOB `monitor_resume()` call or can we defer the context switch to after having done that and released the lock? `monitor_resume()` itself seems to simply schedule a BH which should work regardless if I'm not mistaken. There's also a `readline_show_prompt()` call, but that *looks* harmless? `monitor_resume()` is also called without the lock later on, so even if it needs to be in this context at that point for whatever reason, does it need the lock?