Wolfgang Bumiller <w.bumil...@proxmox.com> writes: > On Wed, Oct 09, 2019 at 10:39:44AM +0200, Markus Armbruster wrote: >> Cc: Marc-André for additional monitor and chardev expertise. >> >> Wolfgang Bumiller <w.bumil...@proxmox.com> writes: >> >> > When a monitor's queue is filled up in handle_qmp_command() >> > it gets suspended. It's the dispatcher bh's job currently to >> > resume the monitor, which it does after processing an event >> > from the queue. However, it is possible for a >> > CHR_EVENT_CLOSED event to be processed before before the bh >> > is scheduled, which will clear the queue without resuming >> > the monitor, thereby preventing the dispatcher from reaching >> > the resume() call. >> >> Because with the request queue cleared, there's nothing for >> monitor_qmp_requests_pop_any_with_lock() to pop, so >> monitor_qmp_bh_dispatcher() won't look at this monitor. It stays >> suspended forever. Correct? >> >> Observable effect for the monitor's user? > > Yes.
I was too terse, let me try again: what exactly breaks for the monitor's user? > More easily triggered now with oob. We ran into this a longer time > ago, but our only reliable trigger was a customized version of > -loadstate which loads the state from a separate file instead of the > vmstate region of a qcow2. Turns out that doing this on a slow storage > (~12s to load the data) caused our status daemon to try to poll the qmp > socket during the load-state and give up after a 3s timeout. And since > the BH runs in the main loop which is not even entered until after the > loadstate has finished, but iothread handling the qmp socket does fill & > clear the queue, the qmp socket always ended up unusable afterwards. > > Aside from that we have users reporting the same symptom (hanging qmp) > appearing randomly on busy systems. > >> > Fix this by resuming the monitor when clearing a queue which >> > was filled up. >> > >> > Signed-off-by: Wolfgang Bumiller <w.bumil...@proxmox.com> >> > --- >> > @Michael, we ran into this with qemu 4.0, so if the logic in this patch >> > is correct it may make sense to include it in the 4.0.1 roundup. >> > A backport is at [1] as 4.0 was before the monitor/ dir split. >> > >> > [1] >> > https://gitlab.com/wbumiller/qemu/commit/9d8bbb5294ed084f282174b0c91e1a614e0a0714 >> > >> > monitor/qmp.c | 10 ++++++++++ >> > 1 file changed, 10 insertions(+) >> > >> > diff --git a/monitor/qmp.c b/monitor/qmp.c >> > index 9d9e5d8b27..c1db5bf940 100644 >> > --- a/monitor/qmp.c >> > +++ b/monitor/qmp.c >> > @@ -70,9 +70,19 @@ static void qmp_request_free(QMPRequest *req) >> > /* Caller must hold mon->qmp.qmp_queue_lock */ >> > static void monitor_qmp_cleanup_req_queue_locked(MonitorQMP *mon) >> > { >> > + bool need_resume = (!qmp_oob_enabled(mon) && >> > mon->qmp_requests->length > 0) >> > + || mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX; >> >> Can you explain why this condition is correct? > > Sorry, I meant to add a comment pointing to monitor_qmp_bh_dispatcher(), > which does the following *after* popping 1 element off the queue: > > need_resume = !qmp_oob_enabled(mon) || > mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX - 1; > qemu_mutex_unlock(&mon->qmp_queue_lock); > > It's supposed to be the same condition, but _before_ popping off an > element (hence no `- 1`), but the queue shouldn't be empty as well > otherwise the `monitor_suspend()` in `handle_qmp_command()` hasn't > happened, though on second though we could probably just return early in > that case.). I see. Could we monitor_resume() unconditionally here? >> > while (!g_queue_is_empty(mon->qmp_requests)) { >> > qmp_request_free(g_queue_pop_head(mon->qmp_requests)); >> > } >> > + if (need_resume) { >> > + /* >> > + * Pairs with the monitor_suspend() in handle_qmp_command() in >> > case the >> > + * queue gets cleared from a CH_EVENT_CLOSED event before the >> > dispatch >> > + * bh got scheduled. >> > + */ >> > + monitor_resume(&mon->common); >> > + } >> > } >> > >> > static void monitor_qmp_cleanup_queues(MonitorQMP *mon) >> >> Is monitor_qmp_cleanup_req_queue_locked() the correct place? >> >> It's called from >> >> * monitor_qmp_event() case CHR_EVENT_CLOSED via >> monitor_qmp_cleanup_queues(), as part of destroying the monitor's >> session state. >> >> This is the case you're trying to fix. Correct? >> >> I figure monitor_resume() is safe because we haven't really destroyed >> anything, yet, we merely flushed the request queue. Correct? >> >> * monitor_data_destroy() via monitor_data_destroy_qmp() when destroying >> the monitor. >> >> Can need_resume be true in this case? If yes, is monitor_resume() >> still safe? We're in the middle of destroying the monitor... > > I thought so when first reading through it, but on second though, we > should probably avoid this for sanity's sake. > Maybe with a flag, or an extra parameter. > Or we could introduce a "bool queue_filled" we set in handle_qmp_command() > instead of "calculating" `need_resume` in 2 places and unset it in > `monitor_data_destroy()` before clearing the queue? Could we simply call monitor_resume() in monitor_qmp_event() right after monitor_qmp_cleanup_queues()?