Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree

Jason Wang Tue, 30 Jul 2024 01:49:12 -0700

On Mon, Jul 29, 2024 at 6:05 PM Eugenio Perez Martin
<epere...@redhat.com> wrote:
>
> On Wed, Jul 24, 2024 at 7:00 PM Jonah Palmer <jonah.pal...@oracle.com> wrote:
> >
> >
> >
> > On 5/13/24 11:56 PM, Jason Wang wrote:
> > > On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin
> > > <epere...@redhat.com> wrote:
> > >>
> > >> On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasow...@redhat.com> wrote:
> > >>>
> > >>> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin
> > >>> <epere...@redhat.com> wrote:
> > >>>>
> > >>>> On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasow...@redhat.com> wrote:
> > >>>>>
> > >>>>> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin
> > >>>>> <epere...@redhat.com> wrote:
> > >>>>>>
> > >>>>>> On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasow...@redhat.com> 
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin 
> > >>>>>>> <epere...@redhat.com> wrote:
> > >>>>>>>>
> > >>>>>>>> On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasow...@redhat.com> 
> > >>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin 
> > >>>>>>>>> <epere...@redhat.com> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasow...@redhat.com> 
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin 
> > >>>>>>>>>>> <epere...@redhat.com> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Tue, May 7, 2024 at 9:29 AM Jason Wang 
> > >>>>>>>>>>>> <jasow...@redhat.com> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > >>>>>>>>>>>>> <epere...@redhat.com> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang 
> > >>>>>>>>>>>>>> <jasow...@redhat.com> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez 
> > >>>>>>>>>>>>>>> <epere...@redhat.com> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> The guest may have overlapped memory regions, where 
> > >>>>>>>>>>>>>>>> different GPA leads
> > >>>>>>>>>>>>>>>> to the same HVA.  This causes a problem when overlapped 
> > >>>>>>>>>>>>>>>> regions
> > >>>>>>>>>>>>>>>> (different GPA but same translated HVA) exists in the 
> > >>>>>>>>>>>>>>>> tree, as looking
> > >>>>>>>>>>>>>>>> them by HVA will return them twice.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I think I don't understand if there's any side effect for 
> > >>>>>>>>>>>>>>> shadow virtqueue?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> My bad, I totally forgot to put a reference to where this 
> > >>>>>>>>>>>>>> comes from.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Si-Wei found that during initialization this sequences of 
> > >>>>>>>>>>>>>> maps /
> > >>>>>>>>>>>>>> unmaps happens [1]:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> HVA                    GPA                IOVA
> > >>>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------
> > >>>>>>>>>>>>>> Map
> > >>>>>>>>>>>>>> [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) 
> > >>>>>>>>>>>>>> [0x1000, 0x80000000)
> > >>>>>>>>>>>>>> [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 
> > >>>>>>>>>>>>>> 0x2080000000)
> > >>>>>>>>>>>>>> [0x80001000, 0x2000001000)
> > >>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > >>>>>>>>>>>>>> [0x2000001000, 0x2000021000)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Unmap
> > >>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) 
> > >>>>>>>>>>>>>> [0x1000,
> > >>>>>>>>>>>>>> 0x20000) ???
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> The third HVA range is contained in the first one, but 
> > >>>>>>>>>>>>>> exposed under a
> > >>>>>>>>>>>>>> different GVA (aliased). This is not "flattened" by QEMU, as 
> > >>>>>>>>>>>>>> GPA does
> > >>>>>>>>>>>>>> not overlap, only HVA.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> At the third chunk unmap, the current algorithm finds the 
> > >>>>>>>>>>>>>> first chunk,
> > >>>>>>>>>>>>>> not the second one. This series is the way to tell the 
> > >>>>>>>>>>>>>> difference at
> > >>>>>>>>>>>>>> unmap time.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> [1] 
> > >>>>>>>>>>>>>> https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html__;!!ACWV5N9M2RV99hQ!MXbGSFHVbqRf0rzyWamOdnBLHP0FUh3r3BnTvGe6Mn5VzXTsajVp3BB7VqlklkRCr5aKazC5xxTCScuR_BY$
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks!
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Ok, I was wondering if we need to store GPA(GIOVA) to HVA 
> > >>>>>>>>>>>>> mappings in
> > >>>>>>>>>>>>> the iova tree to solve this issue completely. Then there 
> > >>>>>>>>>>>>> won't be
> > >>>>>>>>>>>>> aliasing issues.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I'm ok to explore that route but this has another problem. 
> > >>>>>>>>>>>> Both SVQ
> > >>>>>>>>>>>> vrings and CVQ buffers also need to be addressable by 
> > >>>>>>>>>>>> VhostIOVATree,
> > >>>>>>>>>>>> and they do not have GPA.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> At this moment vhost_svq_translate_addr is able to handle this
> > >>>>>>>>>>>> transparently as we translate vaddr to SVQ IOVA. How can we 
> > >>>>>>>>>>>> store
> > >>>>>>>>>>>> these new entries? Maybe a (hwaddr)-1 GPA to signal it has no 
> > >>>>>>>>>>>> GPA and
> > >>>>>>>>>>>> then a list to go through other entries (SVQ vaddr and CVQ 
> > >>>>>>>>>>>> buffers).
> > >>>>>>>>>>>
> > >>>>>>>>>>> This seems to be tricky.
> > >>>>>>>>>>>
> > >>>>>>>>>>> As discussed, it could be another iova tree.
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Yes but there are many ways to add another IOVATree. Let me 
> > >>>>>>>>>> expand & recap.
> > >>>>>>>>>>
> > >>>>>>>>>> Option 1 is to simply add another iova tree to 
> > >>>>>>>>>> VhostShadowVirtqueue.
> > >>>>>>>>>> Let's call it gpa_iova_tree, as opposed to the current iova_tree 
> > >>>>>>>>>> that
> > >>>>>>>>>> translates from vaddr to SVQ IOVA. To know which one to use is 
> > >>>>>>>>>> easy at
> > >>>>>>>>>> adding or removing, like in the memory listener, but how to know 
> > >>>>>>>>>> at
> > >>>>>>>>>> vhost_svq_translate_addr?
> > >>>>>>>>>
> > >>>>>>>>> Then we won't use virtqueue_pop() at all, we need a SVQ version of
> > >>>>>>>>> virtqueue_pop() to translate GPA to SVQ IOVA directly?
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>> The problem is not virtqueue_pop, that's out of the
> > >>>>>>>> vhost_svq_translate_addr. The problem is the need of adding
> > >>>>>>>> conditionals / complexity in all the callers of
> > >>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> The easiest way for me is to rely on memory_region_from_host(). 
> > >>>>>>>>>> When
> > >>>>>>>>>> vaddr is from the guest, it returns a valid MemoryRegion. When 
> > >>>>>>>>>> it is
> > >>>>>>>>>> not, it returns NULL. I'm not sure if this is a valid use case, 
> > >>>>>>>>>> it
> > >>>>>>>>>> just worked in my tests so far.
> > >>>>>>>>>>
> > >>>>>>>>>> Now we have the second problem: The GPA values of the regions of 
> > >>>>>>>>>> the
> > >>>>>>>>>> two IOVA tree must be unique. We need to be able to find 
> > >>>>>>>>>> unallocated
> > >>>>>>>>>> regions in SVQ IOVA. At this moment there is only one IOVATree, 
> > >>>>>>>>>> so
> > >>>>>>>>>> this is done easily by vhost_iova_tree_map_alloc. But it is very
> > >>>>>>>>>> complicated with two trees.
> > >>>>>>>>>
> > >>>>>>>>> Would it be simpler if we decouple the IOVA allocator? For 
> > >>>>>>>>> example, we
> > >>>>>>>>> can have a dedicated gtree to track the allocated IOVA ranges. It 
> > >>>>>>>>> is
> > >>>>>>>>> shared by both
> > >>>>>>>>>
> > >>>>>>>>> 1) Guest memory (GPA)
> > >>>>>>>>> 2) SVQ virtqueue and buffers
> > >>>>>>>>>
> > >>>>>>>>> And another gtree to track the GPA to IOVA.
> > >>>>>>>>>
> > >>>>>>>>> The SVQ code could use either
> > >>>>>>>>>
> > >>>>>>>>> 1) one linear mappings that contains both SVQ virtqueue and 
> > >>>>>>>>> buffers
> > >>>>>>>>>
> > >>>>>>>>> or
> > >>>>>>>>>
> > >>>>>>>>> 2) dynamic IOVA allocation/deallocation helpers
> > >>>>>>>>>
> > >>>>>>>>> So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA?
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>> That's possible, but that scatters the IOVA handling code instead 
> > >>>>>>>> of
> > >>>>>>>> keeping it self-contained in VhostIOVATree.
> > >>>>>>>
> > >>>>>>> To me, the IOVA range/allocation is orthogonal to how IOVA is used.
> > >>>>>>>
> > >>>>>>> An example is the iova allocator in the kernel.
> > >>>>>>>
> > >>>>>>> Note that there's an even simpler IOVA "allocator" in NVME 
> > >>>>>>> passthrough
> > >>>>>>> code, not sure it is useful here though (haven't had a deep look at
> > >>>>>>> that).
> > >>>>>>>
> > >>>>>>
> > >>>>>> I don't know enough about them to have an opinion. I keep seeing the
> > >>>>>> drawback of needing to synchronize both allocation & adding in all 
> > >>>>>> the
> > >>>>>> places we want to modify the IOVATree. At this moment, these are the
> > >>>>>> vhost-vdpa memory listener, the SVQ vring creation and removal, and
> > >>>>>> net CVQ buffers. But it may be more in the future.
> > >>>>>>
> > >>>>>> What are the advantages of keeping these separated that justifies
> > >>>>>> needing to synchronize in all these places, compared with keeping 
> > >>>>>> them
> > >>>>>> synchronized in VhostIOVATree?
> > >>>>>
> > >>>>> It doesn't need to be synchronized.
> > >>>>>
> > >>>>> Assuming guest and SVQ shares IOVA range. IOVA only needs to track
> > >>>>> which part of the range has been used.
> > >>>>>
> > >>>>
> > >>>> Not sure if I follow, that is what I mean with "synchronized".
> > >>>
> > >>> Oh right.
> > >>>
> > >>>>
> > >>>>> This simplifies things, we can store GPA->IOVA mappings and SVQ ->
> > >>>>> IOVA mappings separately.
> > >>>>>
> > >>>>
> > >>>> Sorry, I still cannot see the whole picture :).
> > >>>>
> > >>>> Let's assume we have all the GPA mapped to specific IOVA regions, so
> > >>>> we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ
> > >>>> because of the migration. How can we know where we can place SVQ
> > >>>> vrings without having them synchronized?
> > >>>
> > >>> Just allocating a new IOVA range for SVQ?
> > >>>
> > >>>>
> > >>>> At this moment we're using a tree. The tree nature of the current SVQ
> > >>>> IOVA -> VA makes all nodes ordered so it is more or less easy to look
> > >>>> for holes.
> > >>>
> > >>> Yes, iova allocate could still be implemented via a tree.
> > >>>
> > >>>>
> > >>>> Now your proposal uses the SVQ IOVA as tree values. Should we iterate
> > >>>> over all of them, order them, of the two trees, and then look for
> > >>>> holes there?
> > >>>
> > >>> Let me clarify, correct me if I was wrong:
> > >>>
> > >>> 1) IOVA allocator is still implemented via a tree, we just don't need
> > >>> to store how the IOVA is used
> > >>> 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in
> > >>> the datapath SVQ translation
> > >>> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ
> > >>>
> > >>
> > >> Ok, so the part I was missing is that now we have 3 whole trees, with
> > >> somehow redundant information :).
> > >
> > > Somehow, it decouples the IOVA usage out of the IOVA allocator. This
> > > might be simple as guests and SVQ may try to share a single IOVA
> > > address space.
> > >
> >
> > I'm working on implementing your three suggestions above but I'm a bit
> > confused with some of the wording and I was hoping you could clarify
> > some of it for me when you get the chance.
> >
> > ---
> > For your first suggestion (1) you mention decoupling the IOVA allocator
> > and "don't need to store how the IOVA is used."
> >
> > By this, do you mean to not save the IOVA->HVA mapping and instead only
> > save the allocated IOVA ranges?


Yes.

> > In other words, are you suggesting to
> > create a dedicated "IOVA->IOVA" tree like:
> >
> > struct VhostIOVATree {
> >      uint64_t iova_first;
> >      uint64_t iova_last;
> >      IOVATree *iova_map;
> > };

It could be this or other. I think the point is the allocator is only
used for IOVA allocation but it doesn't have any information about the
mapping.

> >
> > Where the mapping might look something like (where translated_addr is
> > given some kind of 0 value):
> >
> > iova_region = (DMAMap) {
> >      .iova = iova_first,
> >      .translated_addr = 0,
> >      .size = region_size - 1,
> >      .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> > };
> >
> > Also, if this is what you mean by decoupling the IOVA allocator, what
> > happens to the IOVA->HVA tree?

We will have two structures:

1) IOVA domain (in charge of IOVA range allocation/deallocation)
2) How the IOVA is used, e.g an IOVA->HVA tree

> Are we no longer saving these mappings in
> > a tree?
> >
> > ---
> > In your second suggestion (2) with a dedicated GPA->IOVA tree, were you
> > thinking something like this? Just adding on to VhostIOVATree here:
> >
> > struct VhostIOVATree {
> >      uint64_t iova_first;
> >      uint64_t iova_last;
> >      IOVATree *iova_map;
> >      IOVATree *gpa_map;
> > };

I'd split the mappings if it's possible.

> >
> > Where the mapping might look something like:
> >
> > gpa_region = (DMAMap) {
> >      .iova = iova_first,
> >      .translated_addr = gpa_first,
> >      .size = region_size - 1,
> >      .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> > };
> >
> > Also, when you say "used in the datapath SVQ translation", we still need
> > to build the GPA->IOVA tree when vhost_vdpa_listener_region_add() is
> > called, right?

Right.

> >
> > ---
> > Lastly, in your third suggestion (3) you mention implementing a
> > SVQ->IOVA tree, making the total number of IOVATrees/GTrees 3: one for
> > just IOVAs, one for GPA->IOVA, and the last one for SVQ->IOVA. E.g.
> >
> > struct VhostIOVATree {
> >      uint64_t iova_first;
> >      uint64_t iova_last;
> >      IOVATree *iova_map;
> >      IOVATree *gpa_map;
> >      IOVATree *svq_map;
> > };
> >
> > ---
> >
> > Let me know if I'm understanding this correctly. If I am, this would
> > require a pretty good amount of refactoring on the IOVA allocation,
> > searching, destroying, etc. code to account for these new trees.
> >
>
> Ok I think I understand your previous question better, sorry for the delay :).
>
> If I'm not wrong, Jason did not enumerate these as alternatives but as
> part of the same solution. Jason please correct me if I'm wrong here.
>
> His solution is composed of three trees:
> 1) One for the IOVA allocations, so we know where to allocate new ranges
> 2) One of the GPA -> SVQ IOVA translations.
> 3) Another one for SVQ vrings translations.
>

Exactly.

> In my opinion to use GPA this is problematic as we force all of the
> callers to know if the address we want to translate comes from the
> guest or not. Current code does now know it, as
> vhost_svq_translate_addr is called to translate both buffer dataplane
> addresses and control virtqueue buffers, which are also shadowed.  To
> transmit that information to the caller code would require either
> duplicate functions, to add a boolean, etc, as it is in a different
> layer (net specific net/vhost-vdpa.c vs generic
> hw/virtio/vhost-shadow-virtqueue.c).
>
> In my opinion is easier to keep just two trees (or similar structs):
> 1) One for the IOVA allocations, so we know where to allocate new
> ranges. We don't actually need to store the translations here.
> 2) One for HVA -> SVQ IOVA translations.
>
> This way we can accommodate both SVQ vrings, CVQ buffers, and guest
> memory buffers, all on the second tree, and take care of the HVA
> duplications.

Probably, but how to take care of HVA duplications (I may miss the
context, sorry)?

Thanks

>
> Thanks!
>
> > Thanks Jason!
> >
> > >>
> > >> In some sense this is simpler than trying to get all the information
> > >> from only two trees. On the bad side, all SVQ calls that allocate some
> > >> region need to remember to add to one of the two other threes. That is
> > >> what I mean by synchronized. But sure, we can go that way.
> > >
> > > Just a suggestion, if it turns out to complicate the issue, I'm fine
> > > to go the other way.
> > >
> > > Thanks
> > >
> > >>
> > >>> Thanks
> > >>>
> > >>>>
> > >>>>> Thanks
> > >>>>>
> > >>>>>>
> > >>>>>> Thanks!
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >>
> > >
> >
>

Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree

Reply via email to