Vladimir Sementsov-Ogievskiy <vsement...@yandex-team.ru> writes: > For now we only log the vhost device error, when virtqueue is actually > stopped. Let's add a QAPI event, which makes possible: > > - collect statistics of such errors > - make immediate actions: take core dumps or do some other debugging > - inform the user through a management API or UI, so that (s)he can > react somehow, e.g. reset the device driver in the guest or even > build up some automation to do so > > Note that basically every inconsistency discovered during virtqueue > processing results in a silent virtqueue stop. The guest then just > sees the requests getting stuck somewhere in the device for no visible > reason. This event provides a means to inform the management layer of > this situation in a timely fashion. > > The event could be reused for some other virtqueue problems (not only > for vhost devices) in future. For this it gets a generic name and > structure. > > We keep original VHOST_OPS_DEBUG(), to keep original debug output as is > here, it's not the only call to VHOST_OPS_DEBUG in the file.
Likely should be tracepoints. Not this patch's problem, though. > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@yandex-team.ru> > --- > > v6: rename path to qom-path, and improve throttling of the event > improve wording > > hw/virtio/vhost.c | 12 +++++++++--- > monitor/monitor.c | 14 ++++++++++++++ > qapi/qdev.json | 32 ++++++++++++++++++++++++++++++++ > 3 files changed, 55 insertions(+), 3 deletions(-) > > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c > index 6aa72fd434..0b205cef73 100644 > --- a/hw/virtio/vhost.c > +++ b/hw/virtio/vhost.c > @@ -15,6 +15,7 @@ > > #include "qemu/osdep.h" > #include "qapi/error.h" > +#include "qapi/qapi-events-qdev.h" > #include "hw/virtio/vhost.h" > #include "qemu/atomic.h" > #include "qemu/range.h" > @@ -1442,11 +1443,16 @@ static void > vhost_virtqueue_error_notifier(EventNotifier *n) > struct vhost_virtqueue *vq = container_of(n, struct vhost_virtqueue, > error_notifier); > struct vhost_dev *dev = vq->dev; > - int index = vq - dev->vqs; > > if (event_notifier_test_and_clear(n) && dev->vdev) { > - VHOST_OPS_DEBUG(-EINVAL, "vhost vring error in virtqueue %d", > - dev->vq_index + index); > + int ind = vq - dev->vqs + dev->vq_index; > + DeviceState *ds = &dev->vdev->parent_obj; > + > + VHOST_OPS_DEBUG(-EINVAL, "vhost vring error in virtqueue %d", ind); > + qapi_event_send_virtqueue_error(ds->id, ds->canonical_path, ind, > + VIRTQUEUE_ERROR_VHOST_VRING_ERROR, > + "vhost reported failure through > vring " > + "error fd"); > } > } > > diff --git a/monitor/monitor.c b/monitor/monitor.c > index c5a5d30877..11c8859703 100644 > --- a/monitor/monitor.c > +++ b/monitor/monitor.c > @@ -313,6 +313,7 @@ static MonitorQAPIEventConf > monitor_qapi_event_conf[QAPI_EVENT__MAX] = { > [QAPI_EVENT_BALLOON_CHANGE] = { 1000 * SCALE_MS }, > [QAPI_EVENT_QUORUM_REPORT_BAD] = { 1000 * SCALE_MS }, > [QAPI_EVENT_QUORUM_FAILURE] = { 1000 * SCALE_MS }, > + [QAPI_EVENT_VIRTQUEUE_ERROR] = { 1000 * SCALE_MS }, > [QAPI_EVENT_VSERPORT_CHANGE] = { 1000 * SCALE_MS }, > [QAPI_EVENT_MEMORY_DEVICE_SIZE_CHANGE] = { 1000 * SCALE_MS }, > [QAPI_EVENT_HV_BALLOON_STATUS_REPORT] = { 1000 * SCALE_MS }, > @@ -499,6 +500,12 @@ static unsigned int qapi_event_throttle_hash(const void > *key) > hash += g_str_hash(qdict_get_str(evstate->data, "qom-path")); > } > > + if (evstate->event == QAPI_EVENT_VIRTQUEUE_ERROR) { > + uint64_t virtqueue = qdict_get_int(evstate->data, "virtqueue"); > + hash += g_str_hash(qdict_get_str(evstate->data, "qom-path")) ^ > + g_int64_hash(&virtqueue); > + } > + > return hash; > } > > @@ -527,6 +534,13 @@ static gboolean qapi_event_throttle_equal(const void *a, > const void *b) > qdict_get_str(evb->data, "qom-path")); > } > > + if (eva->event == QAPI_EVENT_VIRTQUEUE_ERROR) { > + return !strcmp(qdict_get_str(eva->data, "qom-path"), > + qdict_get_str(evb->data, "qom-path")) && > + (qdict_get_int(eva->data, "virtqueue") == > + qdict_get_int(evb->data, "virtqueue")); > + } > + > return TRUE; > } > Rate-limiting is now per virt queue. It was per device in previous revisions. Worth it? > diff --git a/qapi/qdev.json b/qapi/qdev.json > index 25cbcf977b..ddfae18761 100644 > --- a/qapi/qdev.json > +++ b/qapi/qdev.json > @@ -187,3 +187,35 @@ > { 'command': 'device-sync-config', > 'features': [ 'unstable' ], > 'data': {'id': 'str'} } > + > +## > +# @VirtqueueError: > +# > +# @vhost-vring-error: the vhost device has communicated failure via > +# the vring error file descriptor > +# > +# Since: 10.1 > +## > +{ 'enum': 'VirtqueueError', > + 'data': [ 'vhost-vring-error' ] } > + > +## > +# @VIRTQUEUE_ERROR: > +# > +# Emitted when a device virtqueue fails at runtime. > +# > +# @device: the device's ID if it has one > +# > +# @qom-path: the device's QOM path > +# > +# @virtqueue: the index of the virtqueue that failed > +# > +# @error: error identifier > +# > +# @description: human readable description > +# > +# Since: 10.1 > +## > +{ 'event': 'VIRTQUEUE_ERROR', > + 'data': { '*device': 'str', 'qom-path': 'str', 'virtqueue': 'int', > + 'error': 'VirtqueueError', 'description': 'str'} } Standard question for events: can a management application poll for the information as well? I might have asked this before, I don't remember. If you already answered it, feel free to point me to your answer. Why is this a standard question for events? Say, a management application wants to track the state of X. Two ways: poll the state with a query command that returns it, listen for events that report a change of X. Listening for an event is more efficient. However, if the management application connects to a QEMU instance, X could be anything, so it needs to poll once. Special case: the management application restarts for some reason.