On Mon, Jul 29, 2024 at 7:50 PM Jonah Palmer <jonah.pal...@oracle.com> wrote: > > > > On 7/29/24 6:04 AM, Eugenio Perez Martin wrote: > > On Wed, Jul 24, 2024 at 7:00 PM Jonah Palmer <jonah.pal...@oracle.com> > > wrote: > >> > >> > >> > >> On 5/13/24 11:56 PM, Jason Wang wrote: > >>> On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin > >>> <epere...@redhat.com> wrote: > >>>> > >>>> On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasow...@redhat.com> wrote: > >>>>> > >>>>> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin > >>>>> <epere...@redhat.com> wrote: > >>>>>> > >>>>>> On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasow...@redhat.com> wrote: > >>>>>>> > >>>>>>> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin > >>>>>>> <epere...@redhat.com> wrote: > >>>>>>>> > >>>>>>>> On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasow...@redhat.com> > >>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin > >>>>>>>>> <epere...@redhat.com> wrote: > >>>>>>>>>> > >>>>>>>>>> On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasow...@redhat.com> > >>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin > >>>>>>>>>>> <epere...@redhat.com> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasow...@redhat.com> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin > >>>>>>>>>>>>> <epere...@redhat.com> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Tue, May 7, 2024 at 9:29 AM Jason Wang > >>>>>>>>>>>>>> <jasow...@redhat.com> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin > >>>>>>>>>>>>>>> <epere...@redhat.com> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang > >>>>>>>>>>>>>>>> <jasow...@redhat.com> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez > >>>>>>>>>>>>>>>>> <epere...@redhat.com> wrote: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> The guest may have overlapped memory regions, where > >>>>>>>>>>>>>>>>>> different GPA leads > >>>>>>>>>>>>>>>>>> to the same HVA. This causes a problem when overlapped > >>>>>>>>>>>>>>>>>> regions > >>>>>>>>>>>>>>>>>> (different GPA but same translated HVA) exists in the > >>>>>>>>>>>>>>>>>> tree, as looking > >>>>>>>>>>>>>>>>>> them by HVA will return them twice. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I think I don't understand if there's any side effect for > >>>>>>>>>>>>>>>>> shadow virtqueue? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> My bad, I totally forgot to put a reference to where this > >>>>>>>>>>>>>>>> comes from. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Si-Wei found that during initialization this sequences of > >>>>>>>>>>>>>>>> maps / > >>>>>>>>>>>>>>>> unmaps happens [1]: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> HVA GPA IOVA > >>>>>>>>>>>>>>>> ------------------------------------------------------------------------------------------------------------------------- > >>>>>>>>>>>>>>>> Map > >>>>>>>>>>>>>>>> [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) > >>>>>>>>>>>>>>>> [0x1000, 0x80000000) > >>>>>>>>>>>>>>>> [0x7f7983e00000, 0x7f9903e00000) [0x100000000, > >>>>>>>>>>>>>>>> 0x2080000000) > >>>>>>>>>>>>>>>> [0x80001000, 0x2000001000) > >>>>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > >>>>>>>>>>>>>>>> [0x2000001000, 0x2000021000) > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Unmap > >>>>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > >>>>>>>>>>>>>>>> [0x1000, > >>>>>>>>>>>>>>>> 0x20000) ??? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> The third HVA range is contained in the first one, but > >>>>>>>>>>>>>>>> exposed under a > >>>>>>>>>>>>>>>> different GVA (aliased). This is not "flattened" by QEMU, as > >>>>>>>>>>>>>>>> GPA does > >>>>>>>>>>>>>>>> not overlap, only HVA. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> At the third chunk unmap, the current algorithm finds the > >>>>>>>>>>>>>>>> first chunk, > >>>>>>>>>>>>>>>> not the second one. This series is the way to tell the > >>>>>>>>>>>>>>>> difference at > >>>>>>>>>>>>>>>> unmap time. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> [1] > >>>>>>>>>>>>>>>> https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html__;!!ACWV5N9M2RV99hQ!MXbGSFHVbqRf0rzyWamOdnBLHP0FUh3r3BnTvGe6Mn5VzXTsajVp3BB7VqlklkRCr5aKazC5xxTCScuR_BY$ > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks! > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Ok, I was wondering if we need to store GPA(GIOVA) to HVA > >>>>>>>>>>>>>>> mappings in > >>>>>>>>>>>>>>> the iova tree to solve this issue completely. Then there > >>>>>>>>>>>>>>> won't be > >>>>>>>>>>>>>>> aliasing issues. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I'm ok to explore that route but this has another problem. > >>>>>>>>>>>>>> Both SVQ > >>>>>>>>>>>>>> vrings and CVQ buffers also need to be addressable by > >>>>>>>>>>>>>> VhostIOVATree, > >>>>>>>>>>>>>> and they do not have GPA. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> At this moment vhost_svq_translate_addr is able to handle this > >>>>>>>>>>>>>> transparently as we translate vaddr to SVQ IOVA. How can we > >>>>>>>>>>>>>> store > >>>>>>>>>>>>>> these new entries? Maybe a (hwaddr)-1 GPA to signal it has no > >>>>>>>>>>>>>> GPA and > >>>>>>>>>>>>>> then a list to go through other entries (SVQ vaddr and CVQ > >>>>>>>>>>>>>> buffers). > >>>>>>>>>>>>> > >>>>>>>>>>>>> This seems to be tricky. > >>>>>>>>>>>>> > >>>>>>>>>>>>> As discussed, it could be another iova tree. > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Yes but there are many ways to add another IOVATree. Let me > >>>>>>>>>>>> expand & recap. > >>>>>>>>>>>> > >>>>>>>>>>>> Option 1 is to simply add another iova tree to > >>>>>>>>>>>> VhostShadowVirtqueue. > >>>>>>>>>>>> Let's call it gpa_iova_tree, as opposed to the current iova_tree > >>>>>>>>>>>> that > >>>>>>>>>>>> translates from vaddr to SVQ IOVA. To know which one to use is > >>>>>>>>>>>> easy at > >>>>>>>>>>>> adding or removing, like in the memory listener, but how to know > >>>>>>>>>>>> at > >>>>>>>>>>>> vhost_svq_translate_addr? > >>>>>>>>>>> > >>>>>>>>>>> Then we won't use virtqueue_pop() at all, we need a SVQ version of > >>>>>>>>>>> virtqueue_pop() to translate GPA to SVQ IOVA directly? > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> The problem is not virtqueue_pop, that's out of the > >>>>>>>>>> vhost_svq_translate_addr. The problem is the need of adding > >>>>>>>>>> conditionals / complexity in all the callers of > >>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> The easiest way for me is to rely on memory_region_from_host(). > >>>>>>>>>>>> When > >>>>>>>>>>>> vaddr is from the guest, it returns a valid MemoryRegion. When > >>>>>>>>>>>> it is > >>>>>>>>>>>> not, it returns NULL. I'm not sure if this is a valid use case, > >>>>>>>>>>>> it > >>>>>>>>>>>> just worked in my tests so far. > >>>>>>>>>>>> > >>>>>>>>>>>> Now we have the second problem: The GPA values of the regions of > >>>>>>>>>>>> the > >>>>>>>>>>>> two IOVA tree must be unique. We need to be able to find > >>>>>>>>>>>> unallocated > >>>>>>>>>>>> regions in SVQ IOVA. At this moment there is only one IOVATree, > >>>>>>>>>>>> so > >>>>>>>>>>>> this is done easily by vhost_iova_tree_map_alloc. But it is very > >>>>>>>>>>>> complicated with two trees. > >>>>>>>>>>> > >>>>>>>>>>> Would it be simpler if we decouple the IOVA allocator? For > >>>>>>>>>>> example, we > >>>>>>>>>>> can have a dedicated gtree to track the allocated IOVA ranges. It > >>>>>>>>>>> is > >>>>>>>>>>> shared by both > >>>>>>>>>>> > >>>>>>>>>>> 1) Guest memory (GPA) > >>>>>>>>>>> 2) SVQ virtqueue and buffers > >>>>>>>>>>> > >>>>>>>>>>> And another gtree to track the GPA to IOVA. > >>>>>>>>>>> > >>>>>>>>>>> The SVQ code could use either > >>>>>>>>>>> > >>>>>>>>>>> 1) one linear mappings that contains both SVQ virtqueue and > >>>>>>>>>>> buffers > >>>>>>>>>>> > >>>>>>>>>>> or > >>>>>>>>>>> > >>>>>>>>>>> 2) dynamic IOVA allocation/deallocation helpers > >>>>>>>>>>> > >>>>>>>>>>> So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA? > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> That's possible, but that scatters the IOVA handling code instead > >>>>>>>>>> of > >>>>>>>>>> keeping it self-contained in VhostIOVATree. > >>>>>>>>> > >>>>>>>>> To me, the IOVA range/allocation is orthogonal to how IOVA is used. > >>>>>>>>> > >>>>>>>>> An example is the iova allocator in the kernel. > >>>>>>>>> > >>>>>>>>> Note that there's an even simpler IOVA "allocator" in NVME > >>>>>>>>> passthrough > >>>>>>>>> code, not sure it is useful here though (haven't had a deep look at > >>>>>>>>> that). > >>>>>>>>> > >>>>>>>> > >>>>>>>> I don't know enough about them to have an opinion. I keep seeing the > >>>>>>>> drawback of needing to synchronize both allocation & adding in all > >>>>>>>> the > >>>>>>>> places we want to modify the IOVATree. At this moment, these are the > >>>>>>>> vhost-vdpa memory listener, the SVQ vring creation and removal, and > >>>>>>>> net CVQ buffers. But it may be more in the future. > >>>>>>>> > >>>>>>>> What are the advantages of keeping these separated that justifies > >>>>>>>> needing to synchronize in all these places, compared with keeping > >>>>>>>> them > >>>>>>>> synchronized in VhostIOVATree? > >>>>>>> > >>>>>>> It doesn't need to be synchronized. > >>>>>>> > >>>>>>> Assuming guest and SVQ shares IOVA range. IOVA only needs to track > >>>>>>> which part of the range has been used. > >>>>>>> > >>>>>> > >>>>>> Not sure if I follow, that is what I mean with "synchronized". > >>>>> > >>>>> Oh right. > >>>>> > >>>>>> > >>>>>>> This simplifies things, we can store GPA->IOVA mappings and SVQ -> > >>>>>>> IOVA mappings separately. > >>>>>>> > >>>>>> > >>>>>> Sorry, I still cannot see the whole picture :). > >>>>>> > >>>>>> Let's assume we have all the GPA mapped to specific IOVA regions, so > >>>>>> we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ > >>>>>> because of the migration. How can we know where we can place SVQ > >>>>>> vrings without having them synchronized? > >>>>> > >>>>> Just allocating a new IOVA range for SVQ? > >>>>> > >>>>>> > >>>>>> At this moment we're using a tree. The tree nature of the current SVQ > >>>>>> IOVA -> VA makes all nodes ordered so it is more or less easy to look > >>>>>> for holes. > >>>>> > >>>>> Yes, iova allocate could still be implemented via a tree. > >>>>> > >>>>>> > >>>>>> Now your proposal uses the SVQ IOVA as tree values. Should we iterate > >>>>>> over all of them, order them, of the two trees, and then look for > >>>>>> holes there? > >>>>> > >>>>> Let me clarify, correct me if I was wrong: > >>>>> > >>>>> 1) IOVA allocator is still implemented via a tree, we just don't need > >>>>> to store how the IOVA is used > >>>>> 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in > >>>>> the datapath SVQ translation > >>>>> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ > >>>>> > >>>> > >>>> Ok, so the part I was missing is that now we have 3 whole trees, with > >>>> somehow redundant information :). > >>> > >>> Somehow, it decouples the IOVA usage out of the IOVA allocator. This > >>> might be simple as guests and SVQ may try to share a single IOVA > >>> address space. > >>> > >> > >> I'm working on implementing your three suggestions above but I'm a bit > >> confused with some of the wording and I was hoping you could clarify > >> some of it for me when you get the chance. > >> > >> --- > >> For your first suggestion (1) you mention decoupling the IOVA allocator > >> and "don't need to store how the IOVA is used." > >> > >> By this, do you mean to not save the IOVA->HVA mapping and instead only > >> save the allocated IOVA ranges? In other words, are you suggesting to > >> create a dedicated "IOVA->IOVA" tree like: > >> > >> struct VhostIOVATree { > >> uint64_t iova_first; > >> uint64_t iova_last; > >> IOVATree *iova_map; > >> }; > >> > >> Where the mapping might look something like (where translated_addr is > >> given some kind of 0 value): > >> > >> iova_region = (DMAMap) { > >> .iova = iova_first, > >> .translated_addr = 0, > >> .size = region_size - 1, > >> .perm = IOMMU_ACCESS_FLAG(true, section->readonly), > >> }; > >> > >> Also, if this is what you mean by decoupling the IOVA allocator, what > >> happens to the IOVA->HVA tree? Are we no longer saving these mappings in > >> a tree? > >> > >> --- > >> In your second suggestion (2) with a dedicated GPA->IOVA tree, were you > >> thinking something like this? Just adding on to VhostIOVATree here: > >> > >> struct VhostIOVATree { > >> uint64_t iova_first; > >> uint64_t iova_last; > >> IOVATree *iova_map; > >> IOVATree *gpa_map; > >> }; > >> > >> Where the mapping might look something like: > >> > >> gpa_region = (DMAMap) { > >> .iova = iova_first, > >> .translated_addr = gpa_first, > >> .size = region_size - 1, > >> .perm = IOMMU_ACCESS_FLAG(true, section->readonly), > >> }; > >> > >> Also, when you say "used in the datapath SVQ translation", we still need > >> to build the GPA->IOVA tree when vhost_vdpa_listener_region_add() is > >> called, right? > >> > >> --- > >> Lastly, in your third suggestion (3) you mention implementing a > >> SVQ->IOVA tree, making the total number of IOVATrees/GTrees 3: one for > >> just IOVAs, one for GPA->IOVA, and the last one for SVQ->IOVA. E.g. > >> > >> struct VhostIOVATree { > >> uint64_t iova_first; > >> uint64_t iova_last; > >> IOVATree *iova_map; > >> IOVATree *gpa_map; > >> IOVATree *svq_map; > >> }; > >> > >> --- > >> > >> Let me know if I'm understanding this correctly. If I am, this would > >> require a pretty good amount of refactoring on the IOVA allocation, > >> searching, destroying, etc. code to account for these new trees. > >> > > > > Ok I think I understand your previous question better, sorry for the delay > > :). > > > > If I'm not wrong, Jason did not enumerate these as alternatives but as > > part of the same solution. Jason please correct me if I'm wrong here. > > > > His solution is composed of three trees: > > 1) One for the IOVA allocations, so we know where to allocate new ranges > > 2) One of the GPA -> SVQ IOVA translations. > > 3) Another one for SVQ vrings translations. > > > > In my opinion to use GPA this is problematic as we force all of the > > callers to know if the address we want to translate comes from the > > guest or not. Current code does now know it, as > > vhost_svq_translate_addr is called to translate both buffer dataplane > > addresses and control virtqueue buffers, which are also shadowed. To > > transmit that information to the caller code would require either > > duplicate functions, to add a boolean, etc, as it is in a different > > layer (net specific net/vhost-vdpa.c vs generic > > hw/virtio/vhost-shadow-virtqueue.c). > > > > In my opinion is easier to keep just two trees (or similar structs): > > 1) One for the IOVA allocations, so we know where to allocate new > > ranges. We don't actually need to store the translations here. > > 2) One for HVA -> SVQ IOVA translations. > > > > This way we can accommodate both SVQ vrings, CVQ buffers, and guest > > memory buffers, all on the second tree, and take care of the HVA > > duplications. > > > > Thanks! > > I assume that this dedicated IOVA tree will be created and added to in > vhost_iova_tree_map_alloc --> iova_tree_alloc_map function calls, but > what about the IOVA->HVA tree that gets created during > vhost_vdpa_listener_region_add?
Not sure if I get you. The only IOVA tree that vdpa is using is created at either net/vhost-vdpa.c:vhost_vdpa_net_data_start_first or vhost_vdpa_net_cvq_start. From that moment, other places like vhost_vdpa_listener_region_add just add entries to it. > Will this tree just be replaced with the > dedicated IOVA tree? > I'd say that for a first iteration it is ok to keep using VhostIOVATree->IOVATree, even if the values of the tree (HVA) are not used anymore. > Also, an HVA->SVQ IOVA tree means that the tree is balanced using the > HVA value and not the IOVA value, right? In other words, a tree where > it's more efficient to search using the HVA values vs. IOVA values? > Right, HVA is the key and SVQ IOVA is the value. That way we can detect duplicates and act in consequence. > > > >> Thanks Jason! > >> > >>>> > >>>> In some sense this is simpler than trying to get all the information > >>>> from only two trees. On the bad side, all SVQ calls that allocate some > >>>> region need to remember to add to one of the two other threes. That is > >>>> what I mean by synchronized. But sure, we can go that way. > >>> > >>> Just a suggestion, if it turns out to complicate the issue, I'm fine > >>> to go the other way. > >>> > >>> Thanks > >>> > >>>> > >>>>> Thanks > >>>>> > >>>>>> > >>>>>>> Thanks > >>>>>>> > >>>>>>>> > >>>>>>>> Thanks! > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>> > >> > > >