Vladimir Sementsov-Ogievskiy <vsement...@yandex-team.ru> writes:

> On 04.04.25 09:46, Markus Armbruster wrote:
>> Vladimir Sementsov-Ogievskiy <vsement...@yandex-team.ru> writes:
>> 
>>> For now we only log the vhost device error, when virtqueue is actually
>>> stopped. Let's add a QAPI event, which makes possible:
>>>
>>>   - collect statistics of such errors
>>>   - make immediate actions: take core dumps or do some other debugging
>>>   - inform the user through a management API or UI, so that (s)he can
>>>    react somehow, e.g. reset the device driver in the guest or even
>>>    build up some automation to do so
>>>
>>> Note that basically every inconsistency discovered during virtqueue
>>> processing results in a silent virtqueue stop.  The guest then just
>>> sees the requests getting stuck somewhere in the device for no visible
>>> reason.  This event provides a means to inform the management layer of
>>> this situation in a timely fashion.
>>>
>>> The event could be reused for some other virtqueue problems (not only
>>> for vhost devices) in future. For this it gets a generic name and
>>> structure.
>>>
>>> We keep original VHOST_OPS_DEBUG(), to keep original debug output as is
>>> here, it's not the only call to VHOST_OPS_DEBUG in the file.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@yandex-team.ru>
>>> Reviewed-by: Denis Plotnikov <den-plotni...@yandex-team.ru>
>>> ---
>>>
>>> v5: resend, update version in QAPI to 10.1
>>>     drop a-b by Markus (too much time passed, the context could
>>>     changed. Markus, is the patch still OK?)
>> 
>> Happy to take another look :)
>> 
>>>  hw/virtio/vhost.c | 12 +++++++++---
>>>  monitor/monitor.c | 10 ++++++++++
>>>  qapi/qdev.json    | 32 ++++++++++++++++++++++++++++++++
>>>  3 files changed, 51 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>>> index 6aa72fd434..0b205cef73 100644
>>> --- a/hw/virtio/vhost.c
>>> +++ b/hw/virtio/vhost.c
>>> @@ -15,6 +15,7 @@
>>>   
>>>  #include "qemu/osdep.h"
>>>  #include "qapi/error.h"
>>> +#include "qapi/qapi-events-qdev.h"
>>>  #include "hw/virtio/vhost.h"
>>>  #include "qemu/atomic.h"
>>>  #include "qemu/range.h"
>>> @@ -1442,11 +1443,16 @@ static void 
>>> vhost_virtqueue_error_notifier(EventNotifier *n)
>>>      struct vhost_virtqueue *vq = container_of(n, struct vhost_virtqueue,
>>>                                                error_notifier);
>>>      struct vhost_dev *dev = vq->dev;
>>> -    int index = vq - dev->vqs;
>>>  
>>>      if (event_notifier_test_and_clear(n) && dev->vdev) {
>>> -        VHOST_OPS_DEBUG(-EINVAL,  "vhost vring error in virtqueue %d",
>>> -                        dev->vq_index + index);
>>> +        int ind = vq - dev->vqs + dev->vq_index;
>>> +        DeviceState *ds = &dev->vdev->parent_obj;
>>> +
>>> +        VHOST_OPS_DEBUG(-EINVAL,  "vhost vring error in virtqueue %d", 
>>> ind);
>>> +        qapi_event_send_virtqueue_error(ds->id, ds->canonical_path, ind,
>>> +                                        VIRTQUEUE_ERROR_VHOST_VRING_ERROR,
>>> +                                        "vhost reported failure through 
>>> vring "
>>> +                                        "error fd");
>>>      }
>>>  }
>>>   
>>> diff --git a/monitor/monitor.c b/monitor/monitor.c
>>> index c5a5d30877..1296a9207e 100644
>>> --- a/monitor/monitor.c
>>> +++ b/monitor/monitor.c
>>> @@ -313,6 +313,7 @@ static MonitorQAPIEventConf 
>>> monitor_qapi_event_conf[QAPI_EVENT__MAX] = {
>>>      [QAPI_EVENT_BALLOON_CHANGE]    = { 1000 * SCALE_MS },
>>>      [QAPI_EVENT_QUORUM_REPORT_BAD] = { 1000 * SCALE_MS },
>>>      [QAPI_EVENT_QUORUM_FAILURE]    = { 1000 * SCALE_MS },
>>> +    [QAPI_EVENT_VIRTQUEUE_ERROR]   = { 1000 * SCALE_MS },
>>>      [QAPI_EVENT_VSERPORT_CHANGE]   = { 1000 * SCALE_MS },
>>>      [QAPI_EVENT_MEMORY_DEVICE_SIZE_CHANGE] = { 1000 * SCALE_MS },
>>>      [QAPI_EVENT_HV_BALLOON_STATUS_REPORT] = { 1000 * SCALE_MS },
>>> @@ -499,6 +500,10 @@ static unsigned int qapi_event_throttle_hash(const 
>>> void *key)
>>>          hash += g_str_hash(qdict_get_str(evstate->data, "qom-path"));
>>>      }
>>>   
>>> +    if (evstate->event == QAPI_EVENT_VIRTQUEUE_ERROR) {
>>> +        hash += g_str_hash(qdict_get_str(evstate->data, "device"));
>> 
>> Isn't @device optional?  Should you use @qom-path instead?
>
> Yes, I'm thinking about it too.

This function and the next one together enable fine-grained event
throttling.  I figure the intent of your patch is to rate-limit this
event per device.  For that, you must use qom-path.  If you wanted to
rate-limit per virtqueue, you'd have to throw in @virtqueue.

As is, you have a crash bugL g_str_hash() expects a non-null argument.

>                                 Maybe, drop the "device" field from the event 
> at all, and keep only qom-path? Years passed since movement to use qom paths..

Possible justifications for having @device: consistency with other
events, convenience for users.  I can't judge the latter.

>
>> 
>>> +    }
>>> +
>>>      return hash;
>>>  }
>>>   
>>> @@ -527,6 +532,11 @@ static gboolean qapi_event_throttle_equal(const void 
>>> *a, const void *b)
>>>                         qdict_get_str(evb->data, "qom-path"));
>>>      }
>>>   
>>> +    if (eva->event == QAPI_EVENT_VIRTQUEUE_ERROR) {
>>> +        return !strcmp(qdict_get_str(eva->data, "device"),
>>> +                       qdict_get_str(evb->data, "device"));
>> 
>> Likewise.
>> 
>>> +    }
>>> +
>>>      return TRUE;
>>>  }
>>>   
>>> diff --git a/qapi/qdev.json b/qapi/qdev.json
>>> index 25cbcf977b..2d20f4777e 100644
>>> --- a/qapi/qdev.json
>>> +++ b/qapi/qdev.json
>>> @@ -187,3 +187,35 @@
>>>  { 'command': 'device-sync-config',
>>>    'features': [ 'unstable' ],
>>>    'data': {'id': 'str'} }
>>> +
>>> +##
>>> +# @VirtqueueError:
>>> +#
>>> +# @vhost-vring-error: Vhost device reported failure through
>>> +#     through vring error fd.
>> 
>> One "through" too many.
>> 
>> I'm not quite sure I understand what you're trying to express.  Is it
>> "the vhost device has communicated failure via the vring error file
>> descriptor"?
>
> Yes. Will use your wording.
>
>> 
>> I know next to nothing about vhost devices...
>> 
>>> +#
>>> +# Since: 10.1
>>> +##
>>> +{ 'enum': 'VirtqueueError',
>>> +  'data': [ 'vhost-vring-error' ] }
>>> +
>>> +##
>>> +# @VIRTQUEUE_ERROR:
>>> +#
>>> +# Emitted when a device virtqueue fails in runtime.
>> 
>> I think it's "at runtime".
>> 
>> Can a device have more than one virtqueue?
>
> Yes. And event has "virtqueue": "int" for it.

Got it, thanks!

>>> +#
>>> +# @device: the device's ID if it has one
>>> +#
>>> +# @path: the device's QOM path
>> 
>> I agree with your follow-up: name it @qom-path.
>> 
>>> +#
>>> +# @virtqueue: virtqueue index
>> 
>> Bear with me...  What's a virtqueue index?
>
> sequence number of virtqueue of the vhost device

Maybe something like "the index of the virtqueue that failed".

>>> +#
>>> +# @error: error identifier
>>> +#
>>> +# @description: human readable description
>>> +#
>>> +# Since: 10.1
>>> +##
>>> +{ 'event': 'VIRTQUEUE_ERROR',
>>> + 'data': { '*device': 'str', 'path': 'str', 'virtqueue': 'int',
>>> +            'error': 'VirtqueueError', 'description': 'str'} }
>> 


Reply via email to