On Wed, Oct 09, 2019 at 09:18:04PM +0200, Markus Armbruster wrote: > Wolfgang Bumiller <w.bumil...@proxmox.com> writes: > > > On Wed, Oct 09, 2019 at 10:39:44AM +0200, Markus Armbruster wrote: > >> Cc: Marc-André for additional monitor and chardev expertise. > >> > >> Wolfgang Bumiller <w.bumil...@proxmox.com> writes: > >> > >> > When a monitor's queue is filled up in handle_qmp_command() > >> > it gets suspended. It's the dispatcher bh's job currently to > >> > resume the monitor, which it does after processing an event > >> > from the queue. However, it is possible for a > >> > CHR_EVENT_CLOSED event to be processed before before the bh > >> > is scheduled, which will clear the queue without resuming > >> > the monitor, thereby preventing the dispatcher from reaching > >> > the resume() call. > >> > >> Because with the request queue cleared, there's nothing for > >> monitor_qmp_requests_pop_any_with_lock() to pop, so > >> monitor_qmp_bh_dispatcher() won't look at this monitor. It stays > >> suspended forever. Correct? > >> > >> Observable effect for the monitor's user? > > > > Yes. > > I was too terse, let me try again: what exactly breaks for the monitor's > user?
Any new connections to qmp will be accept()ed and show the greeting ({"QMP":{"version"...}}), but not respond to any messages sent afterwards (though in my tests sometimes (after a few more attempts to reconnect/talk to the same qmp socket), not even the greeting would appear anymore). > > More easily triggered now with oob. We ran into this a longer time > > ago, but our only reliable trigger was a customized version of > > -loadstate which loads the state from a separate file instead of the > > vmstate region of a qcow2. Turns out that doing this on a slow storage > > (~12s to load the data) caused our status daemon to try to poll the qmp > > socket during the load-state and give up after a 3s timeout. And since > > the BH runs in the main loop which is not even entered until after the > > loadstate has finished, but iothread handling the qmp socket does fill & > > clear the queue, the qmp socket always ended up unusable afterwards. > > > > Aside from that we have users reporting the same symptom (hanging qmp) > > appearing randomly on busy systems. > > > >> > Fix this by resuming the monitor when clearing a queue which > >> > was filled up. > >> > > >> > Signed-off-by: Wolfgang Bumiller <w.bumil...@proxmox.com> > >> > --- > >> > @Michael, we ran into this with qemu 4.0, so if the logic in this patch > >> > is correct it may make sense to include it in the 4.0.1 roundup. > >> > A backport is at [1] as 4.0 was before the monitor/ dir split. > >> > > >> > [1] > >> > https://gitlab.com/wbumiller/qemu/commit/9d8bbb5294ed084f282174b0c91e1a614e0a0714 > >> > > >> > monitor/qmp.c | 10 ++++++++++ > >> > 1 file changed, 10 insertions(+) > >> > > >> > diff --git a/monitor/qmp.c b/monitor/qmp.c > >> > index 9d9e5d8b27..c1db5bf940 100644 > >> > --- a/monitor/qmp.c > >> > +++ b/monitor/qmp.c > >> > @@ -70,9 +70,19 @@ static void qmp_request_free(QMPRequest *req) > >> > /* Caller must hold mon->qmp.qmp_queue_lock */ > >> > static void monitor_qmp_cleanup_req_queue_locked(MonitorQMP *mon) > >> > { > >> > + bool need_resume = (!qmp_oob_enabled(mon) && > >> > mon->qmp_requests->length > 0) > >> > + || mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX; > >> > >> Can you explain why this condition is correct? > > > > Sorry, I meant to add a comment pointing to monitor_qmp_bh_dispatcher(), > > which does the following *after* popping 1 element off the queue: > > > > need_resume = !qmp_oob_enabled(mon) || > > mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX - 1; > > qemu_mutex_unlock(&mon->qmp_queue_lock); > > > > It's supposed to be the same condition, but _before_ popping off an > > element (hence no `- 1`), but the queue shouldn't be empty as well > > otherwise the `monitor_suspend()` in `handle_qmp_command()` hasn't > > happened, though on second though we could probably just return early in > > that case.). > > I see. > > Could we monitor_resume() unconditionally here? We can't, because suspend()/resume() employ a counter, which would become unbalanced and could easily become negative by queuing a command without filling the queue up and quickly disconnecting. > > >> > while (!g_queue_is_empty(mon->qmp_requests)) { > >> > qmp_request_free(g_queue_pop_head(mon->qmp_requests)); > >> > } > >> > + if (need_resume) { > >> > + /* > >> > + * Pairs with the monitor_suspend() in handle_qmp_command() in > >> > case the > >> > + * queue gets cleared from a CH_EVENT_CLOSED event before the > >> > dispatch > >> > + * bh got scheduled. > >> > + */ > >> > + monitor_resume(&mon->common); > >> > + } > >> > } > >> > > >> > static void monitor_qmp_cleanup_queues(MonitorQMP *mon) > >> > >> Is monitor_qmp_cleanup_req_queue_locked() the correct place? > >> > >> It's called from > >> > >> * monitor_qmp_event() case CHR_EVENT_CLOSED via > >> monitor_qmp_cleanup_queues(), as part of destroying the monitor's > >> session state. > >> > >> This is the case you're trying to fix. Correct? > >> > >> I figure monitor_resume() is safe because we haven't really destroyed > >> anything, yet, we merely flushed the request queue. Correct? > >> > >> * monitor_data_destroy() via monitor_data_destroy_qmp() when destroying > >> the monitor. > >> > >> Can need_resume be true in this case? If yes, is monitor_resume() > >> still safe? We're in the middle of destroying the monitor... > > > > I thought so when first reading through it, but on second though, we > > should probably avoid this for sanity's sake. > > Maybe with a flag, or an extra parameter. > > Or we could introduce a "bool queue_filled" we set in handle_qmp_command() > > instead of "calculating" `need_resume` in 2 places and unset it in > > `monitor_data_destroy()` before clearing the queue? > > Could we simply call monitor_resume() in monitor_qmp_event() right after > monitor_qmp_cleanup_queues()? We'd still need to make sure the suspend counter is balanced out with the corresponding suspend() calls.