Vladimir Sementsov-Ogievskiy <vsement...@yandex-team.ru> writes:

> For now we only log the vhost device error, when virtqueue is actually
> stopped. Let's add a QAPI event, which makes possible:
>
>  - collect statistics of such errors
>  - make immediate actions: take core dumps or do some other debugging
>  - inform the user through a management API or UI, so that (s)he can
>   react somehow, e.g. reset the device driver in the guest or even
>   build up some automation to do so
>
> Note that basically every inconsistency discovered during virtqueue
> processing results in a silent virtqueue stop.  The guest then just
> sees the requests getting stuck somewhere in the device for no visible
> reason.  This event provides a means to inform the management layer of
> this situation in a timely fashion.
>
> The event could be reused for some other virtqueue problems (not only
> for vhost devices) in future. For this it gets a generic name and
> structure.
>
> We keep original VHOST_OPS_DEBUG(), to keep original debug output as is
> here, it's not the only call to VHOST_OPS_DEBUG in the file.

Likely should be tracepoints.  Not this patch's problem, though.

> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@yandex-team.ru>
> ---
>
> v6: rename path to qom-path, and improve throttling of the event
>     improve wording
>
>  hw/virtio/vhost.c | 12 +++++++++---
>  monitor/monitor.c | 14 ++++++++++++++
>  qapi/qdev.json    | 32 ++++++++++++++++++++++++++++++++
>  3 files changed, 55 insertions(+), 3 deletions(-)
>
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 6aa72fd434..0b205cef73 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -15,6 +15,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "qapi/error.h"
> +#include "qapi/qapi-events-qdev.h"
>  #include "hw/virtio/vhost.h"
>  #include "qemu/atomic.h"
>  #include "qemu/range.h"
> @@ -1442,11 +1443,16 @@ static void 
> vhost_virtqueue_error_notifier(EventNotifier *n)
>      struct vhost_virtqueue *vq = container_of(n, struct vhost_virtqueue,
>                                                error_notifier);
>      struct vhost_dev *dev = vq->dev;
> -    int index = vq - dev->vqs;
>  
>      if (event_notifier_test_and_clear(n) && dev->vdev) {
> -        VHOST_OPS_DEBUG(-EINVAL,  "vhost vring error in virtqueue %d",
> -                        dev->vq_index + index);
> +        int ind = vq - dev->vqs + dev->vq_index;
> +        DeviceState *ds = &dev->vdev->parent_obj;
> +
> +        VHOST_OPS_DEBUG(-EINVAL,  "vhost vring error in virtqueue %d", ind);
> +        qapi_event_send_virtqueue_error(ds->id, ds->canonical_path, ind,
> +                                        VIRTQUEUE_ERROR_VHOST_VRING_ERROR,
> +                                        "vhost reported failure through 
> vring "
> +                                        "error fd");
>      }
>  }
>  
> diff --git a/monitor/monitor.c b/monitor/monitor.c
> index c5a5d30877..11c8859703 100644
> --- a/monitor/monitor.c
> +++ b/monitor/monitor.c
> @@ -313,6 +313,7 @@ static MonitorQAPIEventConf 
> monitor_qapi_event_conf[QAPI_EVENT__MAX] = {
>      [QAPI_EVENT_BALLOON_CHANGE]    = { 1000 * SCALE_MS },
>      [QAPI_EVENT_QUORUM_REPORT_BAD] = { 1000 * SCALE_MS },
>      [QAPI_EVENT_QUORUM_FAILURE]    = { 1000 * SCALE_MS },
> +    [QAPI_EVENT_VIRTQUEUE_ERROR]   = { 1000 * SCALE_MS },
>      [QAPI_EVENT_VSERPORT_CHANGE]   = { 1000 * SCALE_MS },
>      [QAPI_EVENT_MEMORY_DEVICE_SIZE_CHANGE] = { 1000 * SCALE_MS },
>      [QAPI_EVENT_HV_BALLOON_STATUS_REPORT] = { 1000 * SCALE_MS },
> @@ -499,6 +500,12 @@ static unsigned int qapi_event_throttle_hash(const void 
> *key)
>          hash += g_str_hash(qdict_get_str(evstate->data, "qom-path"));
>      }
>  
> +    if (evstate->event == QAPI_EVENT_VIRTQUEUE_ERROR) {
> +        uint64_t virtqueue = qdict_get_int(evstate->data, "virtqueue");
> +        hash += g_str_hash(qdict_get_str(evstate->data, "qom-path")) ^
> +            g_int64_hash(&virtqueue);
> +    }
> +
>      return hash;
>  }
>  
> @@ -527,6 +534,13 @@ static gboolean qapi_event_throttle_equal(const void *a, 
> const void *b)
>                         qdict_get_str(evb->data, "qom-path"));
>      }
>  
> +    if (eva->event == QAPI_EVENT_VIRTQUEUE_ERROR) {
> +        return !strcmp(qdict_get_str(eva->data, "qom-path"),
> +                       qdict_get_str(evb->data, "qom-path")) &&
> +            (qdict_get_int(eva->data, "virtqueue") ==
> +             qdict_get_int(evb->data, "virtqueue"));
> +    }
> +
>      return TRUE;
>  }
>  

Rate-limiting is now per virt queue.  It was per device in previous
revisions.  Worth it?

> diff --git a/qapi/qdev.json b/qapi/qdev.json
> index 25cbcf977b..ddfae18761 100644
> --- a/qapi/qdev.json
> +++ b/qapi/qdev.json
> @@ -187,3 +187,35 @@
>  { 'command': 'device-sync-config',
>    'features': [ 'unstable' ],
>    'data': {'id': 'str'} }
> +
> +##
> +# @VirtqueueError:
> +#
> +# @vhost-vring-error: the vhost device has communicated failure via
> +#     the vring error file descriptor
> +#
> +# Since: 10.1
> +##
> +{ 'enum': 'VirtqueueError',
> +  'data': [ 'vhost-vring-error' ] }
> +
> +##
> +# @VIRTQUEUE_ERROR:
> +#
> +# Emitted when a device virtqueue fails at runtime.
> +#
> +# @device: the device's ID if it has one
> +#
> +# @qom-path: the device's QOM path
> +#
> +# @virtqueue: the index of the virtqueue that failed
> +#
> +# @error: error identifier
> +#
> +# @description: human readable description
> +#
> +# Since: 10.1
> +##
> +{ 'event': 'VIRTQUEUE_ERROR',
> + 'data': { '*device': 'str', 'qom-path': 'str', 'virtqueue': 'int',
> +            'error': 'VirtqueueError', 'description': 'str'} }

Standard question for events: can a management application poll for the
information as well?

I might have asked this before, I don't remember.  If you already
answered it, feel free to point me to your answer.

Why is this a standard question for events?  Say, a management
application wants to track the state of X.  Two ways: poll the state
with a query command that returns it, listen for events that report a
change of X.

Listening for an event is more efficient.

However, if the management application connects to a QEMU instance, X
could be anything, so it needs to poll once.

Special case: the management application restarts for some reason.


Reply via email to