Vladimir Sementsov-Ogievskiy <vsement...@yandex-team.ru> writes: > On 09.04.25 13:48, Markus Armbruster wrote: >> Vladimir Sementsov-Ogievskiy <vsement...@yandex-team.ru> writes: >> >>> For now we only log the vhost device error, when virtqueue is actually >>> stopped. Let's add a QAPI event, which makes possible: >>> >>> - collect statistics of such errors >>> - make immediate actions: take core dumps or do some other debugging >>> - inform the user through a management API or UI, so that (s)he can >>> react somehow, e.g. reset the device driver in the guest or even >>> build up some automation to do so >>> >>> Note that basically every inconsistency discovered during virtqueue >>> processing results in a silent virtqueue stop. The guest then just >>> sees the requests getting stuck somewhere in the device for no visible >>> reason. This event provides a means to inform the management layer of >>> this situation in a timely fashion. >>> >>> The event could be reused for some other virtqueue problems (not only >>> for vhost devices) in future. For this it gets a generic name and >>> structure. >>> >>> We keep original VHOST_OPS_DEBUG(), to keep original debug output as is >>> here, it's not the only call to VHOST_OPS_DEBUG in the file. >> >> Likely should be tracepoints. Not this patch's problem, though. >> >>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@yandex-team.ru> >>> --- >>> >>> v6: rename path to qom-path, and improve throttling of the event >>> improve wording >>> > > [..] > >>> @@ -527,6 +534,13 @@ static gboolean qapi_event_throttle_equal(const void >>> *a, const void *b) >>> qdict_get_str(evb->data, "qom-path")); >>> } >>> >>> + if (eva->event == QAPI_EVENT_VIRTQUEUE_ERROR) { >>> + return !strcmp(qdict_get_str(eva->data, "qom-path"), >>> + qdict_get_str(evb->data, "qom-path")) && >>> + (qdict_get_int(eva->data, "virtqueue") == >>> + qdict_get_int(evb->data, "virtqueue")); >>> + } >>> + >>> return TRUE; >>> } >>> >> >> Rate-limiting is now per virt queue. It was per device in previous >> revisions. Worth it? >> > > Hmm. Probably not. If we have 2 virtqueue, seems good to see both event > (or only one, if only one virtqueue failed). > If we have 256 virtqueues, 256 immediate events seems too much. > So, better is to drop virtqueue here and consider only qom-path for > throttling. > >>> diff --git a/qapi/qdev.json b/qapi/qdev.json >>> index 25cbcf977b..ddfae18761 100644 >>> --- a/qapi/qdev.json >>> +++ b/qapi/qdev.json >>> @@ -187,3 +187,35 @@ >>> { 'command': 'device-sync-config', >>> 'features': [ 'unstable' ], >>> 'data': {'id': 'str'} } >>> + >>> +## >>> +# @VirtqueueError: >>> +# >>> +# @vhost-vring-error: the vhost device has communicated failure via >>> +# the vring error file descriptor >>> +# >>> +# Since: 10.1 >>> +## >>> +{ 'enum': 'VirtqueueError', >>> + 'data': [ 'vhost-vring-error' ] } >>> + >>> +## >>> +# @VIRTQUEUE_ERROR: >>> +# >>> +# Emitted when a device virtqueue fails at runtime. >>> +# >>> +# @device: the device's ID if it has one >>> +# >>> +# @qom-path: the device's QOM path >>> +# >>> +# @virtqueue: the index of the virtqueue that failed >>> +# >>> +# @error: error identifier >>> +# >>> +# @description: human readable description >>> +# >>> +# Since: 10.1 >>> +## >>> +{ 'event': 'VIRTQUEUE_ERROR', >>> + 'data': { '*device': 'str', 'qom-path': 'str', 'virtqueue': 'int', >>> + 'error': 'VirtqueueError', 'description': 'str'} } >> >> Standard question for events: can a management application poll for the >> information as well? > > Oh. that's a good shot. > > I'm afraid it can't. And this makes me to dig into history of this patch > - no, we didn't discussed it before. > > And before trying to implement something new here (a way to get a kind of > virtqueues status by a new QMP command), I check that: > - our mgmt tool still doesn't use VIRTQUEUE_ERROR event (which we've > merged to downstream QEMU long ago, of course) > - the original problem that led us to introducing such event doesn't > bother us for a long time > > It seems wiser to stop here for now. I should have considered these aspects > before beginning the process of reviving this series. Sorry for your time.
Well, *I* could've remembered the standard question at the beginning! So, sorry for your time, too :) > Still, if we (or someone other) need such event in future - good, we have > a modern patch in mailing list to start from. Yes. Thank you! >> I might have asked this before, I don't remember. If you already >> answered it, feel free to point me to your answer. >> >> Why is this a standard question for events? Say, a management >> application wants to track the state of X. Two ways: poll the state >> with a query command that returns it, listen for events that report a >> change of X. >> >> Listening for an event is more efficient. >> >> However, if the management application connects to a QEMU instance, X >> could be anything, so it needs to poll once. >> >> Special case: the management application restarts for some reason.