On 10/9/2024 2:29 AM, Eugenio Perez Martin wrote:
On Tue, Oct 8, 2024 at 10:30 PM Si-Wei Liu <si-wei....@oracle.com> wrote:


On 10/8/2024 8:40 AM, Jonah Palmer wrote:

On 10/8/24 2:51 AM, Eugenio Perez Martin wrote:
On Tue, Oct 8, 2024 at 2:14 AM Si-Wei Liu <si-wei....@oracle.com> wrote:


On 10/7/2024 6:51 AM, Eugenio Perez Martin wrote:
On Fri, Oct 4, 2024 at 8:48 PM Jonah Palmer
<jonah.pal...@oracle.com> wrote:

On 10/4/24 11:17 AM, Eugenio Perez Martin wrote:
On Fri, Oct 4, 2024 at 2:45 PM Jonah Palmer
<jonah.pal...@oracle.com> wrote:
Implements the IOVA->GPA tree for handling mapping, unmapping, and
translations for guest memory regions.

When the guest has overlapping memory regions, an HVA to IOVA
translation
may return an incorrect IOVA when searching the IOVA->HVA tree.
This is
due to one HVA range being contained (overlapping) in another
HVA range
in the IOVA->HVA tree. By creating an IOVA->GPA tree, we can use
GPAs to
translate and find the correct IOVA for guest memory regions.

Yes, this first patch is super close to what I meant, just one issue
and a pair of nits here and there.

I'd leave the second patch as an optimization on top, if the numbers
prove that adding the code is worth it.

Ah okay, gotcha. I also wasn't sure if what you mentioned below on
the
previous series you also wanted implemented or if these would also be
optimizations on top.

[Adding code to the vhost_iova_tree layer for handling multiple
buffers
returned from translation for the memory area where each iovec
covers]:
-----------------------------------------------------------------------

"Let's say that SVQ wants to translate the HVA range
0xfeda0000-0xfedd0000. So it makes available for the device two
chained buffers: One with addr=0x1000 len=0x20000 and the other one
with addr=(0x20000c1000 len=0x10000).

The VirtIO device should be able to translate these two buffers in
isolation and chain them. Not optimal but it helps to keep QEMU
source
clean, as the device already must support it."

This is 100% in the device and QEMU is already able to split the
buffers that way, so we don't need any change in QEMU.
Noted that if working with the full HVA tree directly, the internal
iova
tree linear iterator iova_tree_find_address_iterator() today doesn't
guarantee the iova range returned can cover the entire length of the
iov, so things could happen like that the aliased range with smaller
size (than the requested) happens to be hit first in the linear search
and be returned, the fragmentation of which can't be guarded against by
the VirtIO device or the DMA API mentioned above.

The second patch in this series kind of mitigated this side effect by
sorting out the backing ram_block with the help of
qemu_ram_block_from_host() in case of guest memory backed iov, so it
doesn't really count on vhost_iova_gpa_tree_find_iova() to find the
matching IOVA, but instead (somehow implicitly) avoids the
fragmentation
side effect as mentioned above would never happen. Not saying I like
the
way how it is implemented, but just wanted to point out the implication
if the second patch has to be removed - either add special handling
code
to the iova-tree iterator sizing the range (same as how range selection
based upon permission will be done), or add special code in SVQ
layer to
deal with fragmented IOVA ranges due to aliasing.

This special code in SVQ is already there. And it will be needed even
if we look for the buffers by GPA instead of by vaddr, the same way
virtqueue_map_desc needs to handle it even if it works with GPA.
Continuous GPA does not imply continuous vaddr.

My apologies if I misunderstood something here regarding size &
permission matching, but I attempted to implement both the size and
permission check to iova_tree_find_address_iterator(), however,
there's a sizing mismatch in the vhost_svq_translate_addr() code path
that's causing vhost-net to fail to start.

qemu-system-x86_64: unable to start vhost net: 22: falling back on
userspace virtio

More specifically, in vhost_svq_add_split(), the first call to
vhost_svq_vring_write_descs() returns false:

     ok = vhost_svq_vring_write_descs(svq, sgs, out_sg, out_num, in_num >
                                      0, false);
     if (unlikely(!ok)) {
         return false;
     }

Maybe I misunderstood the proposal, but in
iova_tree_find_address_iterator I'm checking for an exact match for
sizes:

     if (map->size != needle->size || map->perm != needle->perm) {
         return false;
     }
The permission needs to exact match,
Care with this, read only buffers can live in the guest's RW memory. I
think the only actual condition is that if the buffer is writable, the
memory area must be writable. If they don't match, we should keep
looking for the right area.
Indeed.



while the size doesn't have to. The
current iova_tree_find_address_iterator() has the following check:

      if (map->translated_addr + map->size < needle->translated_addr ||
          needle->translated_addr + needle->size < map->translated_addr) {
          return false;
      }

So essentially it does make sure the starting translated_addr on the
needle is greater than or equal to the starting translated_addr on the
map, and the first match of the map range in an ordered linear search
will be returned, but it doesn't guarantee the ending address on the
needle (needle->translated_addr + needle->size) will be covered by the
ending address on the map (map->translated_addr + map->size), so this
implementation is subjected to fragmented iova ranges with contiguous
HVA address. I don't see the current SVQ handles this well, and the
reason of iova fragmentation is due to overlapped region or memory
aliasing (unlike the GPA tree implementation, we have no additional info
to help us identify the exact IOVA when two or more overlapped HVA
ranges are given), which is orthogonal to the HVA fragmentation (with
contiguous GPA) handling in virtqueue_map_desc().

How to handle this situation? Well, I guess Eugenio may have different
opinion, but to me the only way seems to continue to search till a
well-covered IOVA range can be found, or we may have to return multiple
IOVA ranges splitting a contiguous HVA buffer...

Not sure if I followed this 100%, but we must be capable of returning
multiple IOVA ranges if we trust in SVQ to do the translation anyway.

When SVQ starts, the guest memory listener already gives the GPA ->
HVA translations fragmented and unordered, as QEMU can hotplug memory
for example. Let me put a simple example:

GPA [0, 0x1000) -> HVA [0x10000, 0x10100)
GPA [0x1000, 0x2000) -> HVA [0x20000, 0x20100)

Now we send the IOVA to the device this way:
IOVA [0x2000, 0x3000) -> HVA [0x20000, 0x20100)
IOVA [0x3000, 0x4000) -> HVA [0x10000, 0x10100)

So we have this translation tree (assuming we already store them as GPA->IOVA):
GPA [0, 0x1000) -> IOVA [0x3000, 0x4000)
GPA [0x1000, 0x2000) -> IOVA [0x2000, 0x3000)

Now the guest sends this single buffer in a single descriptor:
GPA 0x500, len 0x1000.

We need to split it into two descriptors anyway. Without memory
aliases we're covered at this moment, as virtqueue_pop already solves
these splits for us in virtqueue_map_desc.
That is what I said above, this case (contiguous GA backed by fragmented HVA ranges) is well covered by DMA API called by virtqueue_map_desc(). I was aware this case has been handled in the current SVQ code, i.e. reflected by below comments in vhost_handle_guest_kick(). This is definitively clear to me.

                    /*
                     * This condition is possible since a contiguous buffer in
                     * GPA does not imply a contiguous buffer in qemu's VA
                     * scatter-gather segments. If that happens, the buffer
                     * exposed to the device needs to be a chain of descriptors
                     * at this moment.
                     *
                     * SVQ cannot hold more available buffers if we are here:
                     * queue the current guest descriptor and ignore kicks
                     * until some elements are used.
                     */


Now I realize that SVQ may not be able to cover these splits with
aliases, as we might need to return more addresses than MAX(out_num,
in_num) in vhost_svq_vring_write_descs.
Right, this is the case (fragmented iova ranges with contiguous HVA range due to aliasing) I was referring to, and is not handled at all in the current SVQ or vhost-iova-tree code. This means we have to handle the 1:N mapping for each iovec, which essentially introduces unnecessary complexity rather than simple or straightforward change as claimed to be. Sorry for speaking straight, but I don't really see major benefit in either performance, extensibility or flexibility following this seemingly interim solution.

On the contrary, instead of simplicity this is becoming more than a headache now - Jonah and I have to take extra caution of not breaking any other device which is not using aliased map in the same way as how the mch device does, as the latter is the only case Jonah used to test this patch series. That's why in the morning meeting I asked if more test case can be shared like how Lei Yang did in his previous testing of your early page pinning series (I recall one failing case seems to be related to hot plug) as the early pinning would open up Pandora's box that potentially break a lot of things while memory layout keeps changing in the early initialization time.

Even so with one additional test case, it is not sufficient to prove everything gets done in the right way that can unblock all the memory aliasing use cases, as our direct customer may use vdpa device in various VM setups together with whatever devices, GPU, VGA, VNC or other device we don't know yet but is using aliased map. It'd be going crazy to exercise every possible device, and I don't see the real benefit to build yet another sub-system out of QEMU memory system's realm just to support all these existing devices, given the relevant alias supporting code is built-in and working quite well within QEMU's own memory system. Don't get me wrong, it's not me who asked to add this code but patch #2 is definitely not an optimization patch - we have to make sure the code is being done in the right way no matter what.  Functional correctness always speak than performance or other thing.

  So maybe we need to allocate
the translated addresses array in vhost_svq_vring_write_descs and
return it, or allow it to grow.

Having said that, I'd keep the plan as building something simple that
works first, re-prioritize if we want to focus on reducing the
downtime or in increasing the SVQ throughput, and then act in
consequence by profiling the advantages of the changes. I'm really
looking forward to the switching to (GPA|HVA)->IOVA tree, as I'm sure
it will increase the performance of the solution, but to review all of
it in one shot is out of my bandwith capabilities at the moment.
No problem, and thank you for your time in reviewing. I think after almost get done with the other assigned work I can work with Jonah to implement the desired code changes for (GPA|HVA)->IOVA tree. It's not for optimization or performance - just I feel it's much simpler than take extra burden to implement and maintain those unwarranted, duplicated or preventative code in SVQ or vhost-iova-tree layer just to get around various issues in memory aliasing or overrlapping use cases. Actually I'm already getting regretted not giving Jonan the right guidance in the first place to implement all the necessary virtqueue layer supporting code than use the ram block API hack, believe me it's not a lot, and the code change is really simple and easy to follow or review, there's zero cost in GPA-HVA synchronization than what the current code already did. Otherwise past lesson learned told me it would spend 10x more if not  building it right in the first place - there'll be more cost in future time and effort to refactor or rewrite all the code in later point of time once all kinds of features are already built up.


Thanks,
-Siwei





Regards,
-Siwei



During the device setup phase, vhost_svq_add_split() ->
vhost_svq_vring_write_descs() -> vhost_svq_translate_addr() ->
vhost_iova_tree_find_iova() -> iova_tree_find_iova() is called, but in
iova_tree_find_address_iterator() map->size & needle->size mismatch. I
inserted some printf's to give more information:

...
[    8.019672] ahci 0000:00:1f.2: 6/6 ports implemented (port mask 0x3f)
[    8.020694] ahci 0000:00:1f.2: flags: 64bit ncq only
[    8.022403] scsi host0: ahci
[    8.023511] scsi host1: ahci
[    8.024446] scsi host2: ahci
[    8.025308
vhost_svq_translate_addr: iovec[i].iov_len = 0x8
iova_tree_find_address_iterator: mismatched sizes
map->size: 0xfff, needle->size: 0x8
map->perm: 1, needle->perm: 1
vhost_svq_add_split: _write_descs fail, sgs: 0x7fd85018ea80, out_sg:
0x7ff8649fbb50, out_num: 1, in_sg: 0x7ff8649fbb60, in_num: 1,
more_descs: true, write: false
vhost_vdpa_svq_unmap_ring
vhost_vdpa_svq_unmap_ring
vhost_vdpa_listener_region_del
vhost_vdpa_listener_region_del
vhost_vdpa_listener_region_del
vhost_vdpa_listener_region_del
vhost_vdpa_listener_region_del
vhost_vdpa_svq_unmap_ring
vhost_vdpa_svq_unmap_ring
vhost_vdpa_svq_unmap_ring
vhost_vdpa_svq_unmap_ring
2024-10-08T15:12:22.184902Z qemu-system-x86_64: unable to start vhost
net: 22: falling back on userspace virtio
] scsi host3: ahci
[   10.921733] scsi host4: ahci
[   10.922946] scsi host5: ahci
[   10.923486] virtio_net virtio1 enp0s2: renamed from eth0
...

This is with similar Qemu args as Si-Wei's from way back when:

-m size=128G,slots=8,maxmem=256G

There are also some error catches with just the permission check but
it appears the vhost-net device is still able to start up (when not
matching sizes also).
I think this is because of the forced check for equal sizes, but let
me know if it is not caused by that!

Thanks!



Reply via email to