On 04.04.25 06:36, Halil Pasic wrote:
On Thu, 3 Apr 2025 16:28:31 +0200
David Hildenbrand <da...@redhat.com> wrote:

Sorry I have to have a look at that discussion. Maybe it will answer
some my questions.

Yes, I think so.

Let's fix it without affecting existing setups for now by properly
ignoring the non-existing queues, so the indicator bits will match
the queue indexes.

Just one question. My understanding is that the crux is that Linux
and QEMU (or the driver and the device) disagree at which index
reporting_vq is actually sitting. Is that right?

I thought I made it clear: this is only about the airq indicator bit.
That's where both disagree.

Not the actual queue index (see above).

I did some more research including having a look at that discussion. Let
me try to sum up how did we end up here.

Let me add some more details after digging as well:


Before commit a229989d975e ("virtio: don't allocate vqs when names[i] =
NULL") the kernel behavior used to be in spec, but QEMU and possibly
other hypervisor were out of spec and things did not work.

It all started with VIRTIO_BALLOON_F_FREE_PAGE_HINT. Before that,
we only had the single optional VIRTIO_BALLOON_F_STATS_VQ queue at the very
end. So there was no possibility for holes "in-between".

In the Linux driver, we created the stats queue only if the feature bit
VIRTIO_BALLOON_F_STATS_VQ was actually around:

        nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
        err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);

That changed with VIRTIO_BALLOON_F_FREE_PAGE_HINT, because we would
unconditionally create 4 queues. QEMU always supported the first 3 queues
unconditionally, but old QEMU did obviously not support the (new)
VIRTIO_BALLOON_F_FREE_PAGE_HINT queue.

390x didn't particularly like getting queried for non-existing
queues. [1] So the fix was not for a hypervisor that was out of spec, but
because quering non-existing queues didn't work.

The fix implied that if VIRTIO_BALLOON_F_STATS_VQ is missing, suddenly the queue
index of VIRTIO_BALLOON_F_FREE_PAGE_HINT changed as well.

Again, as QEMU always implemented the 3 first queues unconditionally, this was
not a problem.

[1] 
https://lore.kernel.org/all/c6746307-fae5-7652-af8d-19f560fc3...@de.ibm.com/#t


Possibly because of the complexity of fixing the hypervisor(s) commit
a229989d975e ("virtio: don't allocate vqs when names[i] = NULL") opted
for changing the guest side so that it does not fit the spec but fits
the hypervisor(s). It unfortunately also broke notifiers (for the with
holes) scenario for virtio-ccw only.

Yes, it broke the notifiers.

But note that everything was in spec at that point, because we only documented
"free_page_vq == 3" in the spec *2 years later*, in 2020:

commit 38448268eba0c105200d131c3f7f660129a4d673
Author: Alexander Duyck <alexander.h.du...@linux.intel.com>
Date:   Tue Aug 25 07:45:02 2020 -0700

    content: Document balloon feature free page hints
Free page hints allow the balloon driver to provide information on what
    pages are not currently in use so that we can avoid the cost of copying
    them in migration scenarios. Add a feature description for free page hints
    describing basic functioning and requirements.
At that point, what we documented in the spec *did not match reality* in
Linux. QEMU was fully compatible, because VIRTIO_BALLOON_F_STATS_VQ is
unconditionally set.


QEMU and Linux kept using that queue index assignment model, and the spec
was wrong (out of sync?) at that point. The spec got more wrong with

commit d917d4a8d552c003e046b0e3b1b529d98f7e695b
Author: Alexander Duyck <alexander.h.du...@linux.intel.com>
Date:   Tue Aug 25 07:45:17 2020 -0700

    content: Document balloon feature free page reporting
Free page reporting is a feature that allows the guest to proactively
    report unused pages to the host. By making use of this feature is is
    possible to reduce the overall memory footprint of the guest in cases where
    some significant portion of the memory is idle. Add documentation for the
    free page reporting feature describing the functionality and requirements.

Where we documented VIRTIO_BALLOON_F_REPORTING after the changes were added to
QEMU+Linux implementation, so the spec did not reflect reality.

I'll note also cloud-hypervisor [2] today follows that model.

In particular, it *only* supports VIRTIO_BALLOON_F_REPORTING, turning
the queue index of VIRTIO_BALLOON_F_REPORTING into *2* instead of documented
in the spec to be *4*.

So in reality, we can see VIRTIO_BALLOON_F_REPORTING to be either 2/3/4, 
depending
on the availability of the other two features/queues.

[2] 
https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/virtio-devices/src/balloon.rs



Now we had another look at this, and have concluded that fixing the
hypervisor(s) and fixing the kernel, and making sure that the fixed
kernel can tolerate the old broken hypervisor(s) is way to complicated
if possible at all. So we decided to give the spec a reality check and
fix the notifier bit assignment for virtio-ccw which is broken beyond
doubt if we accept that the correct virtqueue index is the one that the
hypervisor(s) use and not the one that the spec says they should use.

In case of virtio-balloon, it's unfortunate that it went that way, but the
spec simply did not / does not reflect reality when it was added to the spec.


With the spec fixed, the whole notion of "holes" will be something that
does not make sense any more. With that the merit of the kernel interface
virtio_find_vqs() supporting "holes" is quite questionable. Now we need
it because the drivers within the Linux kernel still think of the queues
in terms of the current spec, i.e. they try to have the "holes" as
mandated by the spec, and the duty of making it work with the broken
device implementations falls to the transports.


Right, the "holes" only exist in the input array.

Under the assumption that the spec is indeed going to be fixed:

Reviewed-by: Halil Pasic <pa...@linux.ibm.com>

Thanks!

--
Cheers,

David / dhildenb


Reply via email to