Re: [PATCH] monitor/qmp: resume monitor when clearing its queue

Markus Armbruster Wed, 09 Oct 2019 14:15:23 -0700

Wolfgang Bumiller <w.bumil...@proxmox.com> writes:

> On Wed, Oct 09, 2019 at 10:39:44AM +0200, Markus Armbruster wrote:
>> Cc: Marc-André for additional monitor and chardev expertise.
>> 
>> Wolfgang Bumiller <w.bumil...@proxmox.com> writes:
>> 
>> > When a monitor's queue is filled up in handle_qmp_command()
>> > it gets suspended. It's the dispatcher bh's job currently to
>> > resume the monitor, which it does after processing an event
>> > from the queue. However, it is possible for a
>> > CHR_EVENT_CLOSED event to be processed before before the bh
>> > is scheduled, which will clear the queue without resuming
>> > the monitor, thereby preventing the dispatcher from reaching
>> > the resume() call.
>> 
>> Because with the request queue cleared, there's nothing for
>> monitor_qmp_requests_pop_any_with_lock() to pop, so
>> monitor_qmp_bh_dispatcher() won't look at this monitor.  It stays
>> suspended forever.  Correct?
>> 
>> Observable effect for the monitor's user?
>
> Yes.


I was too terse, let me try again: what exactly breaks for the monitor's
user?

>      More easily triggered now with oob. We ran into this a longer time
> ago, but our only reliable trigger was a customized version of
> -loadstate which loads the state from a separate file instead of the
> vmstate region of a qcow2. Turns out that doing this on a slow storage
> (~12s to load the data) caused our status daemon to try to poll the qmp
> socket during the load-state and give up after a 3s timeout. And since
> the BH runs in the main loop which is not even entered until after the
> loadstate has finished, but iothread handling the qmp socket does fill &
> clear the queue, the qmp socket always ended up unusable afterwards.
>
> Aside from that we have users reporting the same symptom (hanging qmp)
> appearing randomly on busy systems.
>
>> > Fix this by resuming the monitor when clearing a queue which
>> > was filled up.
>> >
>> > Signed-off-by: Wolfgang Bumiller <w.bumil...@proxmox.com>
>> > ---
>> > @Michael, we ran into this with qemu 4.0, so if the logic in this patch
>> > is correct it may make sense to include it in the 4.0.1 roundup.
>> > A backport is at [1] as 4.0 was before the monitor/ dir split.
>> >
>> > [1] 
>> > https://gitlab.com/wbumiller/qemu/commit/9d8bbb5294ed084f282174b0c91e1a614e0a0714
>> >
>> >  monitor/qmp.c | 10 ++++++++++
>> >  1 file changed, 10 insertions(+)
>> >
>> > diff --git a/monitor/qmp.c b/monitor/qmp.c
>> > index 9d9e5d8b27..c1db5bf940 100644
>> > --- a/monitor/qmp.c
>> > +++ b/monitor/qmp.c
>> > @@ -70,9 +70,19 @@ static void qmp_request_free(QMPRequest *req)
>> >  /* Caller must hold mon->qmp.qmp_queue_lock */
>> >  static void monitor_qmp_cleanup_req_queue_locked(MonitorQMP *mon)
>> >  {
>> > +    bool need_resume = (!qmp_oob_enabled(mon) && 
>> > mon->qmp_requests->length > 0)
>> > +        || mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX;
>> 
>> Can you explain why this condition is correct?
>
> Sorry, I meant to add a comment pointing to monitor_qmp_bh_dispatcher(),
> which does the following *after* popping 1 element off the queue:
>
>     need_resume = !qmp_oob_enabled(mon) ||
>         mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX - 1;
>     qemu_mutex_unlock(&mon->qmp_queue_lock);
>
> It's supposed to be the same condition, but _before_ popping off an
> element (hence no `- 1`), but the queue shouldn't be empty as well
> otherwise the `monitor_suspend()` in `handle_qmp_command()` hasn't
> happened, though on second though we could probably just return early in
> that case.).

I see.

Could we monitor_resume() unconditionally here?

>> >      while (!g_queue_is_empty(mon->qmp_requests)) {
>> >          qmp_request_free(g_queue_pop_head(mon->qmp_requests));
>> >      }
>> > +    if (need_resume) {
>> > +        /*
>> > +         * Pairs with the monitor_suspend() in handle_qmp_command() in 
>> > case the
>> > +         * queue gets cleared from a CH_EVENT_CLOSED event before the 
>> > dispatch
>> > +         * bh got scheduled.
>> > +         */
>> > +        monitor_resume(&mon->common);
>> > +    }
>> >  }
>> >  
>> >  static void monitor_qmp_cleanup_queues(MonitorQMP *mon)
>> 
>> Is monitor_qmp_cleanup_req_queue_locked() the correct place?
>> 
>> It's called from
>> 
>> * monitor_qmp_event() case CHR_EVENT_CLOSED via
>>   monitor_qmp_cleanup_queues(), as part of destroying the monitor's
>>   session state.
>> 
>>   This is the case you're trying to fix.  Correct?
>> 
>>   I figure monitor_resume() is safe because we haven't really destroyed
>>   anything, yet, we merely flushed the request queue.  Correct?
>> 
>> * monitor_data_destroy() via monitor_data_destroy_qmp() when destroying
>>   the monitor.
>> 
>>   Can need_resume be true in this case?  If yes, is monitor_resume()
>>   still safe?  We're in the middle of destroying the monitor...
>
> I thought so when first reading through it, but on second though, we
> should probably avoid this for sanity's sake.
> Maybe with a flag, or an extra parameter.
> Or we could introduce a "bool queue_filled" we set in handle_qmp_command()
> instead of "calculating" `need_resume` in 2 places and unset it in
> `monitor_data_destroy()` before clearing the queue?

Could we simply call monitor_resume() in monitor_qmp_event() right after
monitor_qmp_cleanup_queues()?

Re: [PATCH] monitor/qmp: resume monitor when clearing its queue

Reply via email to