Vladimir Sementsov-Ogievskiy <vsement...@yandex-team.ru> writes:

> On 09.04.25 13:48, Markus Armbruster wrote:
>> Vladimir Sementsov-Ogievskiy <vsement...@yandex-team.ru> writes:
>> 
>>> For now we only log the vhost device error, when virtqueue is actually
>>> stopped. Let's add a QAPI event, which makes possible:
>>>
>>>   - collect statistics of such errors
>>>   - make immediate actions: take core dumps or do some other debugging
>>>   - inform the user through a management API or UI, so that (s)he can
>>>    react somehow, e.g. reset the device driver in the guest or even
>>>    build up some automation to do so
>>>
>>> Note that basically every inconsistency discovered during virtqueue
>>> processing results in a silent virtqueue stop.  The guest then just
>>> sees the requests getting stuck somewhere in the device for no visible
>>> reason.  This event provides a means to inform the management layer of
>>> this situation in a timely fashion.
>>>
>>> The event could be reused for some other virtqueue problems (not only
>>> for vhost devices) in future. For this it gets a generic name and
>>> structure.
>>>
>>> We keep original VHOST_OPS_DEBUG(), to keep original debug output as is
>>> here, it's not the only call to VHOST_OPS_DEBUG in the file.
>> 
>> Likely should be tracepoints.  Not this patch's problem, though.
>> 
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@yandex-team.ru>
>>> ---
>>>
>>> v6: rename path to qom-path, and improve throttling of the event
>>>      improve wording
>>>
>
> [..]
>
>>> @@ -527,6 +534,13 @@ static gboolean qapi_event_throttle_equal(const void 
>>> *a, const void *b)
>>>                          qdict_get_str(evb->data, "qom-path"));
>>>       }
>>>   
>>> +    if (eva->event == QAPI_EVENT_VIRTQUEUE_ERROR) {
>>> +        return !strcmp(qdict_get_str(eva->data, "qom-path"),
>>> +                       qdict_get_str(evb->data, "qom-path")) &&
>>> +            (qdict_get_int(eva->data, "virtqueue") ==
>>> +             qdict_get_int(evb->data, "virtqueue"));
>>> +    }
>>> +
>>>       return TRUE;
>>>   }
>>>   
>> 
>> Rate-limiting is now per virt queue.  It was per device in previous
>> revisions.  Worth it?
>> 
>
> Hmm. Probably not. If we have 2 virtqueue, seems good to see both event
> (or only one, if only one virtqueue failed).
> If we have 256 virtqueues, 256 immediate events seems too much.
> So, better is to drop virtqueue here and consider only qom-path for 
> throttling.
>
>>> diff --git a/qapi/qdev.json b/qapi/qdev.json
>>> index 25cbcf977b..ddfae18761 100644
>>> --- a/qapi/qdev.json
>>> +++ b/qapi/qdev.json
>>> @@ -187,3 +187,35 @@
>>>   { 'command': 'device-sync-config',
>>>     'features': [ 'unstable' ],
>>>     'data': {'id': 'str'} }
>>> +
>>> +##
>>> +# @VirtqueueError:
>>> +#
>>> +# @vhost-vring-error: the vhost device has communicated failure via
>>> +#     the vring error file descriptor
>>> +#
>>> +# Since: 10.1
>>> +##
>>> +{ 'enum': 'VirtqueueError',
>>> +  'data': [ 'vhost-vring-error' ] }
>>> +
>>> +##
>>> +# @VIRTQUEUE_ERROR:
>>> +#
>>> +# Emitted when a device virtqueue fails at runtime.
>>> +#
>>> +# @device: the device's ID if it has one
>>> +#
>>> +# @qom-path: the device's QOM path
>>> +#
>>> +# @virtqueue: the index of the virtqueue that failed
>>> +#
>>> +# @error: error identifier
>>> +#
>>> +# @description: human readable description
>>> +#
>>> +# Since: 10.1
>>> +##
>>> +{ 'event': 'VIRTQUEUE_ERROR',
>>> + 'data': { '*device': 'str', 'qom-path': 'str', 'virtqueue': 'int',
>>> +            'error': 'VirtqueueError', 'description': 'str'} }
>> 
>> Standard question for events: can a management application poll for the
>> information as well?
>
> Oh. that's a good shot.
>
> I'm afraid it can't. And this makes me to dig into history of this patch
> - no, we didn't discussed it before.
>
> And before trying to implement something new here (a way to get a kind of
> virtqueues status by a new QMP command), I check that:
> - our mgmt tool still doesn't use VIRTQUEUE_ERROR event (which we've
> merged to downstream QEMU long ago, of course)
> - the original problem that led us to introducing such event doesn't
> bother us for a long time
>
> It seems wiser to stop here for now. I should have considered these aspects
> before beginning the process of reviving this series. Sorry for your time.

Well, *I* could've remembered the standard question at the beginning!
So, sorry for your time, too :)

> Still, if we (or someone other) need such event in future - good, we have
> a modern patch in mailing list to start from.

Yes.  Thank you!

>> I might have asked this before, I don't remember.  If you already
>> answered it, feel free to point me to your answer.
>> 
>> Why is this a standard question for events?  Say, a management
>> application wants to track the state of X.  Two ways: poll the state
>> with a query command that returns it, listen for events that report a
>> change of X.
>> 
>> Listening for an event is more efficient.
>> 
>> However, if the management application connects to a QEMU instance, X
>> could be anything, so it needs to poll once.
>> 
>> Special case: the management application restarts for some reason.


Reply via email to