GPA tree

Si-Wei Liu Tue, 08 Oct 2024 13:31:14 -0700



On 10/8/2024 8:40 AM, Jonah Palmer wrote:

On 10/8/24 2:51 AM, Eugenio Perez Martin wrote:
On Tue, Oct 8, 2024 at 2:14 AM Si-Wei Liu <si-wei....@oracle.com> wrote:
On 10/7/2024 6:51 AM, Eugenio Perez Martin wrote:
On Fri, Oct 4, 2024 at 8:48 PM Jonah Palmer<jonah.pal...@oracle.com> wrote:
On 10/4/24 11:17 AM, Eugenio Perez Martin wrote:
On Fri, Oct 4, 2024 at 2:45 PM Jonah Palmer<jonah.pal...@oracle.com> wrote:
Implements the IOVA->GPA tree for handling mapping, unmapping, and
translations for guest memory regions.
When the guest has overlapping memory regions, an HVA to IOVAtranslationmay return an incorrect IOVA when searching the IOVA->HVA tree.This isdue to one HVA range being contained (overlapping) in anotherHVA rangein the IOVA->HVA tree. By creating an IOVA->GPA tree, we can useGPAs to
translate and find the correct IOVA for guest memory regions.
Yes, this first patch is super close to what I meant, just one issue
and a pair of nits here and there.

I'd leave the second patch as an optimization on top, if the numbers
prove that adding the code is worth it.
Ah okay, gotcha. I also wasn't sure if what you mentioned below onthe
previous series you also wanted implemented or if these would also be
optimizations on top.
[Adding code to the vhost_iova_tree layer for handling multiplebuffersreturned from translation for the memory area where each ioveccovers]:-----------------------------------------------------------------------
"Let's say that SVQ wants to translate the HVA range
0xfeda0000-0xfedd0000. So it makes available for the device two
chained buffers: One with addr=0x1000 len=0x20000 and the other one
with addr=(0x20000c1000 len=0x10000).

The VirtIO device should be able to translate these two buffers in
isolation and chain them. Not optimal but it helps to keep QEMUsource
clean, as the device already must support it."
This is 100% in the device and QEMU is already able to split the
buffers that way, so we don't need any change in QEMU.
Noted that if working with the full HVA tree directly, the internaliova
tree linear iterator iova_tree_find_address_iterator() today doesn't
guarantee the iova range returned can cover the entire length of the
iov, so things could happen like that the aliased range with smaller
size (than the requested) happens to be hit first in the linear search
and be returned, the fragmentation of which can't be guarded against by
the VirtIO device or the DMA API mentioned above.

The second patch in this series kind of mitigated this side effect by
sorting out the backing ram_block with the help of
qemu_ram_block_from_host() in case of guest memory backed iov, so it
doesn't really count on vhost_iova_gpa_tree_find_iova() to find the
matching IOVA, but instead (somehow implicitly) avoids thefragmentationside effect as mentioned above would never happen. Not saying I likethe
way how it is implemented, but just wanted to point out the implication
if the second patch has to be removed - either add special handlingcode
to the iova-tree iterator sizing the range (same as how range selection
based upon permission will be done), or add special code in SVQlayer to
deal with fragmented IOVA ranges due to aliasing.
This special code in SVQ is already there. And it will be needed even
if we look for the buffers by GPA instead of by vaddr, the same way
virtqueue_map_desc needs to handle it even if it works with GPA.
Continuous GPA does not imply continuous vaddr.
My apologies if I misunderstood something here regarding size &permission matching, but I attempted to implement both the size andpermission check to iova_tree_find_address_iterator(), however,there's a sizing mismatch in the vhost_svq_translate_addr() code paththat's causing vhost-net to fail to start.
qemu-system-x86_64: unable to start vhost net: 22: falling back on
userspace virtio
More specifically, in vhost_svq_add_split(), the first call tovhost_svq_vring_write_descs() returns false:
    ok = vhost_svq_vring_write_descs(svq, sgs, out_sg, out_num, in_num >
                                     0, false);
    if (unlikely(!ok)) {
        return false;
    }
Maybe I misunderstood the proposal, but iniova_tree_find_address_iterator I'm checking for an exact match forsizes:
    if (map->size != needle->size || map->perm != needle->perm) {
        return false;
    }

The permission needs to exact match, while the size doesn't have to. Thecurrent iova_tree_find_address_iterator() has the following check:


    if (map->translated_addr + map->size < needle->translated_addr ||
        needle->translated_addr + needle->size < map->translated_addr) {
        return false;
    }

So essentially it does make sure the starting translated_addr on theneedle is greater than or equal to the starting translated_addr on themap, and the first match of the map range in an ordered linear searchwill be returned, but it doesn't guarantee the ending address on theneedle (needle->translated_addr + needle->size) will be covered by theending address on the map (map->translated_addr + map->size), so thisimplementation is subjected to fragmented iova ranges with contiguousHVA address. I don't see the current SVQ handles this well, and thereason of iova fragmentation is due to overlapped region or memoryaliasing (unlike the GPA tree implementation, we have no additional infoto help us identify the exact IOVA when two or more overlapped HVAranges are given), which is orthogonal to the HVA fragmentation (withcontiguous GPA) handling in virtqueue_map_desc().

How to handle this situation? Well, I guess Eugenio may have differentopinion, but to me the only way seems to continue to search till awell-covered IOVA range can be found, or we may have to return multipleIOVA ranges splitting a contiguous HVA buffer...


Regards,
-Siwei

During the device setup phase, vhost_svq_add_split() ->vhost_svq_vring_write_descs() -> vhost_svq_translate_addr() ->vhost_iova_tree_find_iova() -> iova_tree_find_iova() is called, but iniova_tree_find_address_iterator() map->size & needle->size mismatch. Iinserted some printf's to give more information:
...
[    8.019672] ahci 0000:00:1f.2: 6/6 ports implemented (port mask 0x3f)
[    8.020694] ahci 0000:00:1f.2: flags: 64bit ncq only
[    8.022403] scsi host0: ahci
[    8.023511] scsi host1: ahci
[    8.024446] scsi host2: ahci
[    8.025308
vhost_svq_translate_addr: iovec[i].iov_len = 0x8
iova_tree_find_address_iterator: mismatched sizes
map->size: 0xfff, needle->size: 0x8
map->perm: 1, needle->perm: 1
vhost_svq_add_split: _write_descs fail, sgs: 0x7fd85018ea80, out_sg:0x7ff8649fbb50, out_num: 1, in_sg: 0x7ff8649fbb60, in_num: 1,more_descs: true, write: false
vhost_vdpa_svq_unmap_ring
vhost_vdpa_svq_unmap_ring
vhost_vdpa_listener_region_del
vhost_vdpa_listener_region_del
vhost_vdpa_listener_region_del
vhost_vdpa_listener_region_del
vhost_vdpa_listener_region_del
vhost_vdpa_svq_unmap_ring
vhost_vdpa_svq_unmap_ring
vhost_vdpa_svq_unmap_ring
vhost_vdpa_svq_unmap_ring
2024-10-08T15:12:22.184902Z qemu-system-x86_64: unable to start vhostnet: 22: falling back on userspace virtio
] scsi host3: ahci
[   10.921733] scsi host4: ahci
[   10.922946] scsi host5: ahci
[   10.923486] virtio_net virtio1 enp0s2: renamed from eth0
...

This is with similar Qemu args as Si-Wei's from way back when:

-m size=128G,slots=8,maxmem=256G
There are also some error catches with just the permission check butit appears the vhost-net device is still able to start up (when notmatching sizes also).

Re: [RFC v2 1/2] vhost-vdpa: Implement IOVA->GPA tree

Reply via email to