Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree

Eugenio Perez Martin Mon, 29 Jul 2024 11:21:42 -0700

On Mon, Jul 29, 2024 at 7:50 PM Jonah Palmer <jonah.pal...@oracle.com> wrote:
>
>
>
> On 7/29/24 6:04 AM, Eugenio Perez Martin wrote:
> > On Wed, Jul 24, 2024 at 7:00 PM Jonah Palmer <jonah.pal...@oracle.com> 
> > wrote:
> >>
> >>
> >>
> >> On 5/13/24 11:56 PM, Jason Wang wrote:
> >>> On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin
> >>> <epere...@redhat.com> wrote:
> >>>>
> >>>> On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasow...@redhat.com> wrote:
> >>>>>
> >>>>> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin
> >>>>> <epere...@redhat.com> wrote:
> >>>>>>
> >>>>>> On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasow...@redhat.com> wrote:
> >>>>>>>
> >>>>>>> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin
> >>>>>>> <epere...@redhat.com> wrote:
> >>>>>>>>
> >>>>>>>> On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasow...@redhat.com> 
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin 
> >>>>>>>>> <epere...@redhat.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasow...@redhat.com> 
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin 
> >>>>>>>>>>> <epere...@redhat.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasow...@redhat.com> 
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin 
> >>>>>>>>>>>>> <epere...@redhat.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, May 7, 2024 at 9:29 AM Jason Wang 
> >>>>>>>>>>>>>> <jasow...@redhat.com> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> >>>>>>>>>>>>>>> <epere...@redhat.com> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang 
> >>>>>>>>>>>>>>>> <jasow...@redhat.com> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez 
> >>>>>>>>>>>>>>>>> <epere...@redhat.com> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> The guest may have overlapped memory regions, where 
> >>>>>>>>>>>>>>>>>> different GPA leads
> >>>>>>>>>>>>>>>>>> to the same HVA.  This causes a problem when overlapped 
> >>>>>>>>>>>>>>>>>> regions
> >>>>>>>>>>>>>>>>>> (different GPA but same translated HVA) exists in the 
> >>>>>>>>>>>>>>>>>> tree, as looking
> >>>>>>>>>>>>>>>>>> them by HVA will return them twice.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I think I don't understand if there's any side effect for 
> >>>>>>>>>>>>>>>>> shadow virtqueue?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> My bad, I totally forgot to put a reference to where this 
> >>>>>>>>>>>>>>>> comes from.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Si-Wei found that during initialization this sequences of 
> >>>>>>>>>>>>>>>> maps /
> >>>>>>>>>>>>>>>> unmaps happens [1]:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> HVA                    GPA                IOVA
> >>>>>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------
> >>>>>>>>>>>>>>>> Map
> >>>>>>>>>>>>>>>> [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) 
> >>>>>>>>>>>>>>>> [0x1000, 0x80000000)
> >>>>>>>>>>>>>>>> [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 
> >>>>>>>>>>>>>>>> 0x2080000000)
> >>>>>>>>>>>>>>>> [0x80001000, 0x2000001000)
> >>>>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> >>>>>>>>>>>>>>>> [0x2000001000, 0x2000021000)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Unmap
> >>>>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) 
> >>>>>>>>>>>>>>>> [0x1000,
> >>>>>>>>>>>>>>>> 0x20000) ???
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The third HVA range is contained in the first one, but 
> >>>>>>>>>>>>>>>> exposed under a
> >>>>>>>>>>>>>>>> different GVA (aliased). This is not "flattened" by QEMU, as 
> >>>>>>>>>>>>>>>> GPA does
> >>>>>>>>>>>>>>>> not overlap, only HVA.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> At the third chunk unmap, the current algorithm finds the 
> >>>>>>>>>>>>>>>> first chunk,
> >>>>>>>>>>>>>>>> not the second one. This series is the way to tell the 
> >>>>>>>>>>>>>>>> difference at
> >>>>>>>>>>>>>>>> unmap time.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> [1] 
> >>>>>>>>>>>>>>>> https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html__;!!ACWV5N9M2RV99hQ!MXbGSFHVbqRf0rzyWamOdnBLHP0FUh3r3BnTvGe6Mn5VzXTsajVp3BB7VqlklkRCr5aKazC5xxTCScuR_BY$
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks!
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Ok, I was wondering if we need to store GPA(GIOVA) to HVA 
> >>>>>>>>>>>>>>> mappings in
> >>>>>>>>>>>>>>> the iova tree to solve this issue completely. Then there 
> >>>>>>>>>>>>>>> won't be
> >>>>>>>>>>>>>>> aliasing issues.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'm ok to explore that route but this has another problem. 
> >>>>>>>>>>>>>> Both SVQ
> >>>>>>>>>>>>>> vrings and CVQ buffers also need to be addressable by 
> >>>>>>>>>>>>>> VhostIOVATree,
> >>>>>>>>>>>>>> and they do not have GPA.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> At this moment vhost_svq_translate_addr is able to handle this
> >>>>>>>>>>>>>> transparently as we translate vaddr to SVQ IOVA. How can we 
> >>>>>>>>>>>>>> store
> >>>>>>>>>>>>>> these new entries? Maybe a (hwaddr)-1 GPA to signal it has no 
> >>>>>>>>>>>>>> GPA and
> >>>>>>>>>>>>>> then a list to go through other entries (SVQ vaddr and CVQ 
> >>>>>>>>>>>>>> buffers).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This seems to be tricky.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> As discussed, it could be another iova tree.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yes but there are many ways to add another IOVATree. Let me 
> >>>>>>>>>>>> expand & recap.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Option 1 is to simply add another iova tree to 
> >>>>>>>>>>>> VhostShadowVirtqueue.
> >>>>>>>>>>>> Let's call it gpa_iova_tree, as opposed to the current iova_tree 
> >>>>>>>>>>>> that
> >>>>>>>>>>>> translates from vaddr to SVQ IOVA. To know which one to use is 
> >>>>>>>>>>>> easy at
> >>>>>>>>>>>> adding or removing, like in the memory listener, but how to know 
> >>>>>>>>>>>> at
> >>>>>>>>>>>> vhost_svq_translate_addr?
> >>>>>>>>>>>
> >>>>>>>>>>> Then we won't use virtqueue_pop() at all, we need a SVQ version of
> >>>>>>>>>>> virtqueue_pop() to translate GPA to SVQ IOVA directly?
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> The problem is not virtqueue_pop, that's out of the
> >>>>>>>>>> vhost_svq_translate_addr. The problem is the need of adding
> >>>>>>>>>> conditionals / complexity in all the callers of
> >>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> The easiest way for me is to rely on memory_region_from_host(). 
> >>>>>>>>>>>> When
> >>>>>>>>>>>> vaddr is from the guest, it returns a valid MemoryRegion. When 
> >>>>>>>>>>>> it is
> >>>>>>>>>>>> not, it returns NULL. I'm not sure if this is a valid use case, 
> >>>>>>>>>>>> it
> >>>>>>>>>>>> just worked in my tests so far.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Now we have the second problem: The GPA values of the regions of 
> >>>>>>>>>>>> the
> >>>>>>>>>>>> two IOVA tree must be unique. We need to be able to find 
> >>>>>>>>>>>> unallocated
> >>>>>>>>>>>> regions in SVQ IOVA. At this moment there is only one IOVATree, 
> >>>>>>>>>>>> so
> >>>>>>>>>>>> this is done easily by vhost_iova_tree_map_alloc. But it is very
> >>>>>>>>>>>> complicated with two trees.
> >>>>>>>>>>>
> >>>>>>>>>>> Would it be simpler if we decouple the IOVA allocator? For 
> >>>>>>>>>>> example, we
> >>>>>>>>>>> can have a dedicated gtree to track the allocated IOVA ranges. It 
> >>>>>>>>>>> is
> >>>>>>>>>>> shared by both
> >>>>>>>>>>>
> >>>>>>>>>>> 1) Guest memory (GPA)
> >>>>>>>>>>> 2) SVQ virtqueue and buffers
> >>>>>>>>>>>
> >>>>>>>>>>> And another gtree to track the GPA to IOVA.
> >>>>>>>>>>>
> >>>>>>>>>>> The SVQ code could use either
> >>>>>>>>>>>
> >>>>>>>>>>> 1) one linear mappings that contains both SVQ virtqueue and 
> >>>>>>>>>>> buffers
> >>>>>>>>>>>
> >>>>>>>>>>> or
> >>>>>>>>>>>
> >>>>>>>>>>> 2) dynamic IOVA allocation/deallocation helpers
> >>>>>>>>>>>
> >>>>>>>>>>> So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA?
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> That's possible, but that scatters the IOVA handling code instead 
> >>>>>>>>>> of
> >>>>>>>>>> keeping it self-contained in VhostIOVATree.
> >>>>>>>>>
> >>>>>>>>> To me, the IOVA range/allocation is orthogonal to how IOVA is used.
> >>>>>>>>>
> >>>>>>>>> An example is the iova allocator in the kernel.
> >>>>>>>>>
> >>>>>>>>> Note that there's an even simpler IOVA "allocator" in NVME 
> >>>>>>>>> passthrough
> >>>>>>>>> code, not sure it is useful here though (haven't had a deep look at
> >>>>>>>>> that).
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> I don't know enough about them to have an opinion. I keep seeing the
> >>>>>>>> drawback of needing to synchronize both allocation & adding in all 
> >>>>>>>> the
> >>>>>>>> places we want to modify the IOVATree. At this moment, these are the
> >>>>>>>> vhost-vdpa memory listener, the SVQ vring creation and removal, and
> >>>>>>>> net CVQ buffers. But it may be more in the future.
> >>>>>>>>
> >>>>>>>> What are the advantages of keeping these separated that justifies
> >>>>>>>> needing to synchronize in all these places, compared with keeping 
> >>>>>>>> them
> >>>>>>>> synchronized in VhostIOVATree?
> >>>>>>>
> >>>>>>> It doesn't need to be synchronized.
> >>>>>>>
> >>>>>>> Assuming guest and SVQ shares IOVA range. IOVA only needs to track
> >>>>>>> which part of the range has been used.
> >>>>>>>
> >>>>>>
> >>>>>> Not sure if I follow, that is what I mean with "synchronized".
> >>>>>
> >>>>> Oh right.
> >>>>>
> >>>>>>
> >>>>>>> This simplifies things, we can store GPA->IOVA mappings and SVQ ->
> >>>>>>> IOVA mappings separately.
> >>>>>>>
> >>>>>>
> >>>>>> Sorry, I still cannot see the whole picture :).
> >>>>>>
> >>>>>> Let's assume we have all the GPA mapped to specific IOVA regions, so
> >>>>>> we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ
> >>>>>> because of the migration. How can we know where we can place SVQ
> >>>>>> vrings without having them synchronized?
> >>>>>
> >>>>> Just allocating a new IOVA range for SVQ?
> >>>>>
> >>>>>>
> >>>>>> At this moment we're using a tree. The tree nature of the current SVQ
> >>>>>> IOVA -> VA makes all nodes ordered so it is more or less easy to look
> >>>>>> for holes.
> >>>>>
> >>>>> Yes, iova allocate could still be implemented via a tree.
> >>>>>
> >>>>>>
> >>>>>> Now your proposal uses the SVQ IOVA as tree values. Should we iterate
> >>>>>> over all of them, order them, of the two trees, and then look for
> >>>>>> holes there?
> >>>>>
> >>>>> Let me clarify, correct me if I was wrong:
> >>>>>
> >>>>> 1) IOVA allocator is still implemented via a tree, we just don't need
> >>>>> to store how the IOVA is used
> >>>>> 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in
> >>>>> the datapath SVQ translation
> >>>>> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ
> >>>>>
> >>>>
> >>>> Ok, so the part I was missing is that now we have 3 whole trees, with
> >>>> somehow redundant information :).
> >>>
> >>> Somehow, it decouples the IOVA usage out of the IOVA allocator. This
> >>> might be simple as guests and SVQ may try to share a single IOVA
> >>> address space.
> >>>
> >>
> >> I'm working on implementing your three suggestions above but I'm a bit
> >> confused with some of the wording and I was hoping you could clarify
> >> some of it for me when you get the chance.
> >>
> >> ---
> >> For your first suggestion (1) you mention decoupling the IOVA allocator
> >> and "don't need to store how the IOVA is used."
> >>
> >> By this, do you mean to not save the IOVA->HVA mapping and instead only
> >> save the allocated IOVA ranges? In other words, are you suggesting to
> >> create a dedicated "IOVA->IOVA" tree like:
> >>
> >> struct VhostIOVATree {
> >>       uint64_t iova_first;
> >>       uint64_t iova_last;
> >>       IOVATree *iova_map;
> >> };
> >>
> >> Where the mapping might look something like (where translated_addr is
> >> given some kind of 0 value):
> >>
> >> iova_region = (DMAMap) {
> >>       .iova = iova_first,
> >>       .translated_addr = 0,
> >>       .size = region_size - 1,
> >>       .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> >> };
> >>
> >> Also, if this is what you mean by decoupling the IOVA allocator, what
> >> happens to the IOVA->HVA tree? Are we no longer saving these mappings in
> >> a tree?
> >>
> >> ---
> >> In your second suggestion (2) with a dedicated GPA->IOVA tree, were you
> >> thinking something like this? Just adding on to VhostIOVATree here:
> >>
> >> struct VhostIOVATree {
> >>       uint64_t iova_first;
> >>       uint64_t iova_last;
> >>       IOVATree *iova_map;
> >>       IOVATree *gpa_map;
> >> };
> >>
> >> Where the mapping might look something like:
> >>
> >> gpa_region = (DMAMap) {
> >>       .iova = iova_first,
> >>       .translated_addr = gpa_first,
> >>       .size = region_size - 1,
> >>       .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> >> };
> >>
> >> Also, when you say "used in the datapath SVQ translation", we still need
> >> to build the GPA->IOVA tree when vhost_vdpa_listener_region_add() is
> >> called, right?
> >>
> >> ---
> >> Lastly, in your third suggestion (3) you mention implementing a
> >> SVQ->IOVA tree, making the total number of IOVATrees/GTrees 3: one for
> >> just IOVAs, one for GPA->IOVA, and the last one for SVQ->IOVA. E.g.
> >>
> >> struct VhostIOVATree {
> >>       uint64_t iova_first;
> >>       uint64_t iova_last;
> >>       IOVATree *iova_map;
> >>       IOVATree *gpa_map;
> >>       IOVATree *svq_map;
> >> };
> >>
> >> ---
> >>
> >> Let me know if I'm understanding this correctly. If I am, this would
> >> require a pretty good amount of refactoring on the IOVA allocation,
> >> searching, destroying, etc. code to account for these new trees.
> >>
> >
> > Ok I think I understand your previous question better, sorry for the delay 
> > :).
> >
> > If I'm not wrong, Jason did not enumerate these as alternatives but as
> > part of the same solution. Jason please correct me if I'm wrong here.
> >
> > His solution is composed of three trees:
> > 1) One for the IOVA allocations, so we know where to allocate new ranges
> > 2) One of the GPA -> SVQ IOVA translations.
> > 3) Another one for SVQ vrings translations.
> >
> > In my opinion to use GPA this is problematic as we force all of the
> > callers to know if the address we want to translate comes from the
> > guest or not. Current code does now know it, as
> > vhost_svq_translate_addr is called to translate both buffer dataplane
> > addresses and control virtqueue buffers, which are also shadowed.  To
> > transmit that information to the caller code would require either
> > duplicate functions, to add a boolean, etc, as it is in a different
> > layer (net specific net/vhost-vdpa.c vs generic
> > hw/virtio/vhost-shadow-virtqueue.c).
> >
> > In my opinion is easier to keep just two trees (or similar structs):
> > 1) One for the IOVA allocations, so we know where to allocate new
> > ranges. We don't actually need to store the translations here.
> > 2) One for HVA -> SVQ IOVA translations.
> >
> > This way we can accommodate both SVQ vrings, CVQ buffers, and guest
> > memory buffers, all on the second tree, and take care of the HVA
> > duplications.
> >
> > Thanks!
>
> I assume that this dedicated IOVA tree will be created and added to in
> vhost_iova_tree_map_alloc --> iova_tree_alloc_map function calls, but
> what about the IOVA->HVA tree that gets created during
> vhost_vdpa_listener_region_add?


Not sure if I get you. The only IOVA tree that vdpa is using is
created at either net/vhost-vdpa.c:vhost_vdpa_net_data_start_first or
vhost_vdpa_net_cvq_start. From that moment, other places like
vhost_vdpa_listener_region_add just add entries to it.

> Will this tree just be replaced with the
> dedicated IOVA tree?
>

I'd say that for a first iteration it is ok to keep using
VhostIOVATree->IOVATree, even if the values of the tree (HVA) are not
used anymore.

> Also, an HVA->SVQ IOVA tree means that the tree is balanced using the
> HVA value and not the IOVA value, right? In other words, a tree where
> it's more efficient to search using the HVA values vs. IOVA values?
>

Right, HVA is the key and SVQ IOVA is the value. That way we can
detect duplicates and act in consequence.

> >
> >> Thanks Jason!
> >>
> >>>>
> >>>> In some sense this is simpler than trying to get all the information
> >>>> from only two trees. On the bad side, all SVQ calls that allocate some
> >>>> region need to remember to add to one of the two other threes. That is
> >>>> what I mean by synchronized. But sure, we can go that way.
> >>>
> >>> Just a suggestion, if it turns out to complicate the issue, I'm fine
> >>> to go the other way.
> >>>
> >>> Thanks
> >>>
> >>>>
> >>>>> Thanks
> >>>>>
> >>>>>>
> >>>>>>> Thanks
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >
>

Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree

Reply via email to