On Wed, May 8, 2024 at 2:52 AM Si-Wei Liu <si-wei....@oracle.com> wrote: > > > > On 5/1/2024 11:44 PM, Eugenio Perez Martin wrote: > > On Thu, May 2, 2024 at 1:16 AM Si-Wei Liu <si-wei....@oracle.com> wrote: > >> > >> > >> On 4/30/2024 10:19 AM, Eugenio Perez Martin wrote: > >>> On Tue, Apr 30, 2024 at 7:55 AM Si-Wei Liu <si-wei....@oracle.com> wrote: > >>>> > >>>> On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote: > >>>>> On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu <si-wei....@oracle.com> > >>>>> wrote: > >>>>>> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote: > >>>>>>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei....@oracle.com> > >>>>>>> wrote: > >>>>>>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote: > >>>>>>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei....@oracle.com> > >>>>>>>>> wrote: > >>>>>>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote: > >>>>>>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu > >>>>>>>>>>> <si-wei....@oracle.com> wrote: > >>>>>>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote: > >>>>>>>>>>>>> IOVA tree is also used to track the mappings of virtio-net > >>>>>>>>>>>>> shadow > >>>>>>>>>>>>> virtqueue. This mappings may not match with the GPA->HVA ones. > >>>>>>>>>>>>> > >>>>>>>>>>>>> This causes a problem when overlapped regions (different GPA > >>>>>>>>>>>>> but same > >>>>>>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will > >>>>>>>>>>>>> return > >>>>>>>>>>>>> them twice. To solve this, create an id member so we can > >>>>>>>>>>>>> assign unique > >>>>>>>>>>>>> identifiers (GPA) to the maps. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Signed-off-by: Eugenio Pérez <epere...@redhat.com> > >>>>>>>>>>>>> --- > >>>>>>>>>>>>> include/qemu/iova-tree.h | 5 +++-- > >>>>>>>>>>>>> util/iova-tree.c | 3 ++- > >>>>>>>>>>>>> 2 files changed, 5 insertions(+), 3 deletions(-) > >>>>>>>>>>>>> > >>>>>>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h > >>>>>>>>>>>>> index 2a10a7052e..34ee230e7d 100644 > >>>>>>>>>>>>> --- a/include/qemu/iova-tree.h > >>>>>>>>>>>>> +++ b/include/qemu/iova-tree.h > >>>>>>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap { > >>>>>>>>>>>>> hwaddr iova; > >>>>>>>>>>>>> hwaddr translated_addr; > >>>>>>>>>>>>> hwaddr size; /* Inclusive */ > >>>>>>>>>>>>> + uint64_t id; > >>>>>>>>>>>>> IOMMUAccessFlags perm; > >>>>>>>>>>>>> } QEMU_PACKED DMAMap; > >>>>>>>>>>>>> typedef gboolean (*iova_tree_iterator)(DMAMap *map); > >>>>>>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree > >>>>>>>>>>>>> *tree, const DMAMap *map); > >>>>>>>>>>>>> * @map: the mapping to search > >>>>>>>>>>>>> * > >>>>>>>>>>>>> * Search for a mapping in the iova tree that > >>>>>>>>>>>>> translated_addr overlaps with the > >>>>>>>>>>>>> - * mapping range specified. Only the first found mapping will > >>>>>>>>>>>>> be > >>>>>>>>>>>>> - * returned. > >>>>>>>>>>>>> + * mapping range specified and map->id is equal. Only the > >>>>>>>>>>>>> first found > >>>>>>>>>>>>> + * mapping will be returned. > >>>>>>>>>>>>> * > >>>>>>>>>>>>> * Return: DMAMap pointer if found, or NULL if not > >>>>>>>>>>>>> found. Note that > >>>>>>>>>>>>> * the returned DMAMap pointer is maintained > >>>>>>>>>>>>> internally. User should > >>>>>>>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c > >>>>>>>>>>>>> index 536789797e..0863e0a3b8 100644 > >>>>>>>>>>>>> --- a/util/iova-tree.c > >>>>>>>>>>>>> +++ b/util/iova-tree.c > >>>>>>>>>>>>> @@ -97,7 +97,8 @@ static gboolean > >>>>>>>>>>>>> iova_tree_find_address_iterator(gpointer key, gpointer value, > >>>>>>>>>>>>> > >>>>>>>>>>>>> needle = args->needle; > >>>>>>>>>>>>> if (map->translated_addr + map->size < > >>>>>>>>>>>>> needle->translated_addr || > >>>>>>>>>>>>> - needle->translated_addr + needle->size < > >>>>>>>>>>>>> map->translated_addr) { > >>>>>>>>>>>>> + needle->translated_addr + needle->size < > >>>>>>>>>>>>> map->translated_addr || > >>>>>>>>>>>>> + needle->id != map->id) { > >>>>>>>>>>>> It looks this iterator can also be invoked by SVQ from > >>>>>>>>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest > >>>>>>>>>>>> GPA > >>>>>>>>>>>> space will be searched on without passing in the ID (GPA), and > >>>>>>>>>>>> exact > >>>>>>>>>>>> match for the same GPA range is not actually needed unlike the > >>>>>>>>>>>> mapping > >>>>>>>>>>>> removal case. Could we create an API variant, for the SVQ lookup > >>>>>>>>>>>> case > >>>>>>>>>>>> specifically? Or alternatively, add a special flag, say > >>>>>>>>>>>> skip_id_match to > >>>>>>>>>>>> DMAMap, and the id match check may look like below: > >>>>>>>>>>>> > >>>>>>>>>>>> (!needle->skip_id_match && needle->id != map->id) > >>>>>>>>>>>> > >>>>>>>>>>>> I think vhost_svq_translate_addr() could just call the API > >>>>>>>>>>>> variant or > >>>>>>>>>>>> pass DMAmap with skip_id_match set to true to > >>>>>>>>>>>> svq_iova_tree_find_iova(). > >>>>>>>>>>>> > >>>>>>>>>>> I think you're totally right. But I'd really like to not > >>>>>>>>>>> complicate > >>>>>>>>>>> the API of the iova_tree more. > >>>>>>>>>>> > >>>>>>>>>>> I think we can look for the hwaddr using memory_region_from_host > >>>>>>>>>>> and > >>>>>>>>>>> then get the hwaddr. It is another lookup though... > >>>>>>>>>> Yeah, that will be another means of doing translation without > >>>>>>>>>> having to > >>>>>>>>>> complicate the API around iova_tree. I wonder how the lookup > >>>>>>>>>> through > >>>>>>>>>> memory_region_from_host() may perform compared to the iova tree > >>>>>>>>>> one, the > >>>>>>>>>> former looks to be an O(N) linear search on a linked list while the > >>>>>>>>>> latter would be roughly O(log N) on an AVL tree? > >>>>>>>>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is > >>>>>>>>> linear too. It is not even ordered. > >>>>>>>> Oh Sorry, I misread the code and I should look for g_tree_foreach () > >>>>>>>> instead of g_tree_search_node(). So the former is indeed linear > >>>>>>>> iteration, but it looks to be ordered? > >>>>>>>> > >>>>>>>> https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115 > >>>>>>> The GPA / IOVA are ordered but we're looking by QEMU's vaddr. > >>>>>>> > >>>>>>> If we have these translations: > >>>>>>> [0x1000, 0x2000] -> [0x10000, 0x11000] > >>>>>>> [0x2000, 0x3000] -> [0x6000, 0x7000] > >>>>>>> > >>>>>>> We will see them in this order, so we cannot stop the search at the > >>>>>>> first node. > >>>>>> Yeah, reverse lookup is unordered indeed, anyway. > >>>>>> > >>>>>>>>> But apart from this detail you're right, I have the same concerns > >>>>>>>>> with > >>>>>>>>> this solution too. If we see a hard performance regression we could > >>>>>>>>> go > >>>>>>>>> to more complicated solutions, like maintaining a reverse IOVATree > >>>>>>>>> in > >>>>>>>>> vhost-iova-tree too. First RFCs of SVQ did that actually. > >>>>>>>> Agreed, yeap we can use memory_region_from_host for now. Any reason > >>>>>>>> why > >>>>>>>> reverse IOVATree was dropped, lack of users? But now we have one! > >>>>>>>> > >>>>>>> No, it is just simplicity. We already have an user in the hot patch in > >>>>>>> the master branch, vhost_svq_vring_write_descs. But I never profiled > >>>>>>> enough to find if it is a bottleneck or not to be honest. > >>>>>> Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to > >>>>>> profile and see the difference. > >>>>>>> I'll send the new series by today, thank you for finding these issues! > >>>>>> Thanks! In case you don't have bandwidth to add back reverse IOVA tree, > >>>>>> Jonah (cc'ed) may have interest in looking into it. > >>>>>> > >>>>> Actually, yes. I've tried to solve it using: > >>>>> memory_region_get_ram_ptr -> It's hard to get this pointer to work > >>>>> without messing a lot with IOVATree. > >>>>> memory_region_find -> I'm totally unable to make it return sections > >>>>> that make sense > >>>>> flatview_for_each_range -> It does not return the same > >>>>> MemoryRegionsection as the listener, not sure why. > >>>> Ouch, thank you for the summary of attempts that were done earlier. > >>>>> The only advance I have is that memory_region_from_host is able to > >>>>> tell if the vaddr is from the guest or not. > >>>> Hmmm, then it won't be too useful without a direct means to identifying > >>>> the exact memory region associated with the iova that is being mapped. > >>>> And, this additional indirection seems introduce a tiny bit of more > >>>> latency in the reverse lookup routine (should not be a scalability issue > >>>> though if it's a linear search)? > >>>> > >>> I didn't measure, but I guess yes it might. OTOH these structs may be > >>> cached because virtqueue_pop just looked for them. > >> Oh, right, that's a good point. > >>>>> So I'm convinced there must be a way to do it with the memory > >>>>> subsystem, but I think the best way to do it ATM is to store a > >>>>> parallel tree with GPA-> SVQ IOVA translations. At removal time, if we > >>>>> find the entry in this new tree, we can directly remove it by GPA. If > >>>>> not, assume it is a host-only address like SVQ vrings, and remove by > >>>>> iterating on vaddr as we do now. > >>>> Yeah, this could work I think. On the other hand, given that we are now > >>>> trying to improve it, I wonder if possible to come up with a fast > >>>> version for the SVQ (host-only address) case without having to look up > >>>> twice? SVQ callers should be able to tell apart from the guest case > >>>> where GPA -> IOVA translation doesn't exist? Or just maintain a parallel > >>>> tree with HVA -> IOVA translations for SVQ reverse lookup only? I feel > >>>> SVQ mappings may be worth a separate fast lookup path - unlike guest > >>>> mappings, the insertion, lookup and removal for SVQ mappings seem > >>>> unavoidable during the migration downtime path. > >>>> > >>> I think the ideal order is the opposite actually. So: > >>> 1) Try for the NIC to support _F_VRING_ASID, no translation needed by QEMU > >> Right, that's the case for _F_VRING_ASID, which is simple and easy to > >> deal with. Though I think this is an edge case across all vendor > >> devices, as most likely only those no-chip IOMMU parents may support it. > >> It's a luxury for normal device to steal another VF for this ASID > >> feature... > >> > >>> 2) Try reverse lookup from HVA to GPA. Since dataplane should fit > >>> this, we should test this first > >> So instead of a direct lookup from HVA to IOVA, the purpose of the extra > >> reverse lookup from HVA to GPA is to verify the validity of GPA (avoid > >> from being mistakenly picked from the overlapped region)? But this would > >> seem require scanning the entire GPA space to identify possible GPA > >> ranges that are potentially overlapped? I wonder if there exists > >> possibility to simplify this assumption, could we go this extra layer of > >> GPA wide scan and validation, *only* when overlap is indeed detected > >> during memory listerner's region_add (say during which we try to insert > >> a duplicate / overlapped HVA into the HVA -> IOVA tree)? Or simply put, > >> the first match on the reverse lookup would mostly suffice, since we > >> know virtio driver can't use guest memory from these overlapped regions? > > The first match should be enough, but maybe we need more than one > > match. Let me put an example: > > > > The buffer is (vaddr = 0x1000, size=0x3000). Now the tree contains two > > overlapped entries: (vaddr=0x1000, size=0x2000), and (vaddr=0x1000, > > size=0x3000). > In this case, assume the overlap can be detected via certain structs, > for e.g. a HVA->IOVA reverse tree, then a full and slow lookup needs to > be performed. Here we can try to match using the size, but I feel its > best to identify the exact IOVA range by the GPA. This can be done > through a tree storing the GPA->HVA mappings, and the reverse lookup > from HVA->GPA will help identify if the HVA falls into certain GPA range. >
It is possible somehow, but multiple searches are already used in other areas where the full range is not found in the first attempt. First one may return a partial result, but the second one can look for the missing part of the key (vaddr=0x2000, size=0x1000). Isn't that simpler? > > > > Assuming we go through the reverse IOVA tree, we had bad luck and we > > stored the small entry plus the big entry. The first search returns > > the small entry then, (vaddr=0x1000, size=0x2000),. Calling code must > > detect it, and then look for vaddr = 0x1000 + 0x2000. That gives us > > the next entry. > Is there any reason why the first search can't pass in the GPA to > further help identify? Suppose it's verified that the specific GPA range > does exists via the HVA->GPA lookup. The only problem is that IOVATree is shared with intel_iommu. How to keep modifying it without affecting IOVATree usage by intel_iommu might be a problem. > > > > You can see that virtqueue_map_desc translates this way if > > dma_memory_map returns a translation shorter than the length of the > > buffer, for example. > > > >> You may say this assumption is too bold, but do we have other means to > >> guarantee the first match will always hit under SVQ lookup? Given that > >> we don't receive an instance of issue report until we move the memory > >> listener registration upfront to device initialization, I guess there > >> should be some point or under certain condition that the non-overlapped > >> 1:1 translation and lookup can be satisfied. Don't you agree? > >> > > To be able to build the shorter is desirable, yes. Maybe it can be > > done in this series, but I find it hard to solve some situations. For > > example, is it possible to have three overlapping regions (A, B, C) > > where regions A and B do not overlap but C overlaps both of them? > Does C map to a different GPA range than where region A and B reside > originally? The flatten guest view should guarantee that, right? Then it > shouldn't be a problem by passing in the GPA as the secondary ID for the > reverse HVA->IOVA lookup. > Right. But in this RFC the id is not searched in full range, only the first GPA of each region. > Regards, > -Siwei > > > > That's why I think it is better to delay that to a future series, but > > we can do it with one shot if it is simple enough for sure. > > > > Thanks! > > > >> Thanks, > >> -Siwei > >>> 3) Look in SVQ host-only entries (SVQ, current shadow CVQ). It is the > >>> control VQ, speed is not so important. > >>> > >>> Overlapping regions may return the wrong SVQ IOVA though. We should > >>> take extra care to make sure these are correctly handled. I mean, > >>> there are valid translations in the tree unless the driver is buggy, > >>> just may need to span many translations. > >>> > >>>>> It is guaranteed the guest does not > >>>>> translate to that vaddr and that that vaddr is unique in the tree > >>>>> anyway. > >>>>> > >>>>> Does it sound reasonable? Jonah, would you be interested in moving this > >>>>> forward? > >>>> My thought would be that the reverse IOVA tree stuff can be added as a > >>>> follow-up optimization right after for extended scalability, but for now > >>>> as the interim, we may still need some form of simple fix, so as to > >>>> quickly unblock the other dependent work built on top of this one and > >>>> the early pinning series [1]. With it said, I'm completely fine if > >>>> performing the reverse lookup through linear tree walk e.g. > >>>> g_tree_foreach(), that should suffice small VM configs with just a > >>>> couple of queues and limited number of memory regions. Going forward, to > >>>> address the scalability bottleneck, Jonah could just replace the > >>>> corresponding API call with the one built on top of reverse IOVA tree (I > >>>> presume the use of these iova tree APIs is kind of internal that only > >>>> limits to SVQ and vhost-vdpa subsystems) once he gets there, and then > >>>> eliminate the other API variants that will no longer be in use. What do > >>>> you think about this idea / plan? > >>>> > >>> Yeah it makes sense to me. Hopefully we can even get rid of the id member. > >>> > >>>> Thanks, > >>>> -Siwei > >>>> > >>>> [1] > >>>> https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html > >>>> > >>>>> Thanks! > >>>>> > >>>>>> -Siwei > >>>>>> > >>>>>> > >>>>>>>> Thanks, > >>>>>>>> -Siwei > >>>>>>>>> Thanks! > >>>>>>>>> > >>>>>>>>>> Of course, > >>>>>>>>>> memory_region_from_host() won't search out of the guest memory > >>>>>>>>>> space for > >>>>>>>>>> sure. As this could be on the hot data path I have a little bit > >>>>>>>>>> hesitance over the potential cost or performance regression this > >>>>>>>>>> change > >>>>>>>>>> could bring in, but maybe I'm overthinking it too much... > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> -Siwei > >>>>>>>>>> > >>>>>>>>>>>> Thanks, > >>>>>>>>>>>> -Siwei > >>>>>>>>>>>>> return false; > >>>>>>>>>>>>> } > >>>>>>>>>>>>> >