On 09.04.25 13:48, Markus Armbruster wrote:
Vladimir Sementsov-Ogievskiy <vsement...@yandex-team.ru> writes:

For now we only log the vhost device error, when virtqueue is actually
stopped. Let's add a QAPI event, which makes possible:

  - collect statistics of such errors
  - make immediate actions: take core dumps or do some other debugging
  - inform the user through a management API or UI, so that (s)he can
   react somehow, e.g. reset the device driver in the guest or even
   build up some automation to do so

Note that basically every inconsistency discovered during virtqueue
processing results in a silent virtqueue stop.  The guest then just
sees the requests getting stuck somewhere in the device for no visible
reason.  This event provides a means to inform the management layer of
this situation in a timely fashion.

The event could be reused for some other virtqueue problems (not only
for vhost devices) in future. For this it gets a generic name and
structure.

We keep original VHOST_OPS_DEBUG(), to keep original debug output as is
here, it's not the only call to VHOST_OPS_DEBUG in the file.

Likely should be tracepoints.  Not this patch's problem, though.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@yandex-team.ru>
---

v6: rename path to qom-path, and improve throttling of the event
     improve wording


[..]

@@ -527,6 +534,13 @@ static gboolean qapi_event_throttle_equal(const void *a, 
const void *b)
                         qdict_get_str(evb->data, "qom-path"));
      }
+ if (eva->event == QAPI_EVENT_VIRTQUEUE_ERROR) {
+        return !strcmp(qdict_get_str(eva->data, "qom-path"),
+                       qdict_get_str(evb->data, "qom-path")) &&
+            (qdict_get_int(eva->data, "virtqueue") ==
+             qdict_get_int(evb->data, "virtqueue"));
+    }
+
      return TRUE;
  }

Rate-limiting is now per virt queue.  It was per device in previous
revisions.  Worth it?


Hmm. Probably not. If we have 2 virtqueue, seems good to see both event
(or only one, if only one virtqueue failed).
If we have 256 virtqueues, 256 immediate events seems too much.
So, better is to drop virtqueue here and consider only qom-path for throttling.

diff --git a/qapi/qdev.json b/qapi/qdev.json
index 25cbcf977b..ddfae18761 100644
--- a/qapi/qdev.json
+++ b/qapi/qdev.json
@@ -187,3 +187,35 @@
  { 'command': 'device-sync-config',
    'features': [ 'unstable' ],
    'data': {'id': 'str'} }
+
+##
+# @VirtqueueError:
+#
+# @vhost-vring-error: the vhost device has communicated failure via
+#     the vring error file descriptor
+#
+# Since: 10.1
+##
+{ 'enum': 'VirtqueueError',
+  'data': [ 'vhost-vring-error' ] }
+
+##
+# @VIRTQUEUE_ERROR:
+#
+# Emitted when a device virtqueue fails at runtime.
+#
+# @device: the device's ID if it has one
+#
+# @qom-path: the device's QOM path
+#
+# @virtqueue: the index of the virtqueue that failed
+#
+# @error: error identifier
+#
+# @description: human readable description
+#
+# Since: 10.1
+##
+{ 'event': 'VIRTQUEUE_ERROR',
+ 'data': { '*device': 'str', 'qom-path': 'str', 'virtqueue': 'int',
+            'error': 'VirtqueueError', 'description': 'str'} }

Standard question for events: can a management application poll for the
information as well?

Oh. that's a good shot.

I'm afraid it can't. And this makes me to dig into history of this patch
- no, we didn't discussed it before.

And before trying to implement something new here (a way to get a kind of
virtqueues status by a new QMP command), I check that:
- our mgmt tool still doesn't use VIRTQUEUE_ERROR event (which we've
merged to downstream QEMU long ago, of course)
- the original problem that led us to introducing such event doesn't
bother us for a long time

It seems wiser to stop here for now. I should have considered these aspects
before beginning the process of reviving this series. Sorry for your time.

Still, if we (or someone other) need such event in future - good, we have
a modern patch in mailing list to start from.


I might have asked this before, I don't remember.  If you already
answered it, feel free to point me to your answer.

Why is this a standard question for events?  Say, a management
application wants to track the state of X.  Two ways: poll the state
with a query command that returns it, listen for events that report a
change of X.

Listening for an event is more efficient.

However, if the management application connects to a QEMU instance, X
could be anything, so it needs to poll once.

Special case: the management application restarts for some reason.


--
Best regards,
Vladimir


Reply via email to