On Mon, Jul 29, 2024 at 6:05 PM Eugenio Perez Martin <epere...@redhat.com> wrote: > > On Wed, Jul 24, 2024 at 7:00 PM Jonah Palmer <jonah.pal...@oracle.com> wrote: > > > > > > > > On 5/13/24 11:56 PM, Jason Wang wrote: > > > On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin > > > <epere...@redhat.com> wrote: > > >> > > >> On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasow...@redhat.com> wrote: > > >>> > > >>> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin > > >>> <epere...@redhat.com> wrote: > > >>>> > > >>>> On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasow...@redhat.com> wrote: > > >>>>> > > >>>>> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin > > >>>>> <epere...@redhat.com> wrote: > > >>>>>> > > >>>>>> On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasow...@redhat.com> > > >>>>>> wrote: > > >>>>>>> > > >>>>>>> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin > > >>>>>>> <epere...@redhat.com> wrote: > > >>>>>>>> > > >>>>>>>> On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasow...@redhat.com> > > >>>>>>>> wrote: > > >>>>>>>>> > > >>>>>>>>> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin > > >>>>>>>>> <epere...@redhat.com> wrote: > > >>>>>>>>>> > > >>>>>>>>>> On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasow...@redhat.com> > > >>>>>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin > > >>>>>>>>>>> <epere...@redhat.com> wrote: > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Tue, May 7, 2024 at 9:29 AM Jason Wang > > >>>>>>>>>>>> <jasow...@redhat.com> wrote: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin > > >>>>>>>>>>>>> <epere...@redhat.com> wrote: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang > > >>>>>>>>>>>>>> <jasow...@redhat.com> wrote: > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez > > >>>>>>>>>>>>>>> <epere...@redhat.com> wrote: > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> The guest may have overlapped memory regions, where > > >>>>>>>>>>>>>>>> different GPA leads > > >>>>>>>>>>>>>>>> to the same HVA. This causes a problem when overlapped > > >>>>>>>>>>>>>>>> regions > > >>>>>>>>>>>>>>>> (different GPA but same translated HVA) exists in the > > >>>>>>>>>>>>>>>> tree, as looking > > >>>>>>>>>>>>>>>> them by HVA will return them twice. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> I think I don't understand if there's any side effect for > > >>>>>>>>>>>>>>> shadow virtqueue? > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> My bad, I totally forgot to put a reference to where this > > >>>>>>>>>>>>>> comes from. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Si-Wei found that during initialization this sequences of > > >>>>>>>>>>>>>> maps / > > >>>>>>>>>>>>>> unmaps happens [1]: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> HVA GPA IOVA > > >>>>>>>>>>>>>> ------------------------------------------------------------------------------------------------------------------------- > > >>>>>>>>>>>>>> Map > > >>>>>>>>>>>>>> [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) > > >>>>>>>>>>>>>> [0x1000, 0x80000000) > > >>>>>>>>>>>>>> [0x7f7983e00000, 0x7f9903e00000) [0x100000000, > > >>>>>>>>>>>>>> 0x2080000000) > > >>>>>>>>>>>>>> [0x80001000, 0x2000001000) > > >>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > > >>>>>>>>>>>>>> [0x2000001000, 0x2000021000) > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Unmap > > >>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > > >>>>>>>>>>>>>> [0x1000, > > >>>>>>>>>>>>>> 0x20000) ??? > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> The third HVA range is contained in the first one, but > > >>>>>>>>>>>>>> exposed under a > > >>>>>>>>>>>>>> different GVA (aliased). This is not "flattened" by QEMU, as > > >>>>>>>>>>>>>> GPA does > > >>>>>>>>>>>>>> not overlap, only HVA. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> At the third chunk unmap, the current algorithm finds the > > >>>>>>>>>>>>>> first chunk, > > >>>>>>>>>>>>>> not the second one. This series is the way to tell the > > >>>>>>>>>>>>>> difference at > > >>>>>>>>>>>>>> unmap time. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> [1] > > >>>>>>>>>>>>>> https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html__;!!ACWV5N9M2RV99hQ!MXbGSFHVbqRf0rzyWamOdnBLHP0FUh3r3BnTvGe6Mn5VzXTsajVp3BB7VqlklkRCr5aKazC5xxTCScuR_BY$ > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Thanks! > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Ok, I was wondering if we need to store GPA(GIOVA) to HVA > > >>>>>>>>>>>>> mappings in > > >>>>>>>>>>>>> the iova tree to solve this issue completely. Then there > > >>>>>>>>>>>>> won't be > > >>>>>>>>>>>>> aliasing issues. > > >>>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> I'm ok to explore that route but this has another problem. > > >>>>>>>>>>>> Both SVQ > > >>>>>>>>>>>> vrings and CVQ buffers also need to be addressable by > > >>>>>>>>>>>> VhostIOVATree, > > >>>>>>>>>>>> and they do not have GPA. > > >>>>>>>>>>>> > > >>>>>>>>>>>> At this moment vhost_svq_translate_addr is able to handle this > > >>>>>>>>>>>> transparently as we translate vaddr to SVQ IOVA. How can we > > >>>>>>>>>>>> store > > >>>>>>>>>>>> these new entries? Maybe a (hwaddr)-1 GPA to signal it has no > > >>>>>>>>>>>> GPA and > > >>>>>>>>>>>> then a list to go through other entries (SVQ vaddr and CVQ > > >>>>>>>>>>>> buffers). > > >>>>>>>>>>> > > >>>>>>>>>>> This seems to be tricky. > > >>>>>>>>>>> > > >>>>>>>>>>> As discussed, it could be another iova tree. > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> Yes but there are many ways to add another IOVATree. Let me > > >>>>>>>>>> expand & recap. > > >>>>>>>>>> > > >>>>>>>>>> Option 1 is to simply add another iova tree to > > >>>>>>>>>> VhostShadowVirtqueue. > > >>>>>>>>>> Let's call it gpa_iova_tree, as opposed to the current iova_tree > > >>>>>>>>>> that > > >>>>>>>>>> translates from vaddr to SVQ IOVA. To know which one to use is > > >>>>>>>>>> easy at > > >>>>>>>>>> adding or removing, like in the memory listener, but how to know > > >>>>>>>>>> at > > >>>>>>>>>> vhost_svq_translate_addr? > > >>>>>>>>> > > >>>>>>>>> Then we won't use virtqueue_pop() at all, we need a SVQ version of > > >>>>>>>>> virtqueue_pop() to translate GPA to SVQ IOVA directly? > > >>>>>>>>> > > >>>>>>>> > > >>>>>>>> The problem is not virtqueue_pop, that's out of the > > >>>>>>>> vhost_svq_translate_addr. The problem is the need of adding > > >>>>>>>> conditionals / complexity in all the callers of > > >>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> The easiest way for me is to rely on memory_region_from_host(). > > >>>>>>>>>> When > > >>>>>>>>>> vaddr is from the guest, it returns a valid MemoryRegion. When > > >>>>>>>>>> it is > > >>>>>>>>>> not, it returns NULL. I'm not sure if this is a valid use case, > > >>>>>>>>>> it > > >>>>>>>>>> just worked in my tests so far. > > >>>>>>>>>> > > >>>>>>>>>> Now we have the second problem: The GPA values of the regions of > > >>>>>>>>>> the > > >>>>>>>>>> two IOVA tree must be unique. We need to be able to find > > >>>>>>>>>> unallocated > > >>>>>>>>>> regions in SVQ IOVA. At this moment there is only one IOVATree, > > >>>>>>>>>> so > > >>>>>>>>>> this is done easily by vhost_iova_tree_map_alloc. But it is very > > >>>>>>>>>> complicated with two trees. > > >>>>>>>>> > > >>>>>>>>> Would it be simpler if we decouple the IOVA allocator? For > > >>>>>>>>> example, we > > >>>>>>>>> can have a dedicated gtree to track the allocated IOVA ranges. It > > >>>>>>>>> is > > >>>>>>>>> shared by both > > >>>>>>>>> > > >>>>>>>>> 1) Guest memory (GPA) > > >>>>>>>>> 2) SVQ virtqueue and buffers > > >>>>>>>>> > > >>>>>>>>> And another gtree to track the GPA to IOVA. > > >>>>>>>>> > > >>>>>>>>> The SVQ code could use either > > >>>>>>>>> > > >>>>>>>>> 1) one linear mappings that contains both SVQ virtqueue and > > >>>>>>>>> buffers > > >>>>>>>>> > > >>>>>>>>> or > > >>>>>>>>> > > >>>>>>>>> 2) dynamic IOVA allocation/deallocation helpers > > >>>>>>>>> > > >>>>>>>>> So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA? > > >>>>>>>>> > > >>>>>>>> > > >>>>>>>> That's possible, but that scatters the IOVA handling code instead > > >>>>>>>> of > > >>>>>>>> keeping it self-contained in VhostIOVATree. > > >>>>>>> > > >>>>>>> To me, the IOVA range/allocation is orthogonal to how IOVA is used. > > >>>>>>> > > >>>>>>> An example is the iova allocator in the kernel. > > >>>>>>> > > >>>>>>> Note that there's an even simpler IOVA "allocator" in NVME > > >>>>>>> passthrough > > >>>>>>> code, not sure it is useful here though (haven't had a deep look at > > >>>>>>> that). > > >>>>>>> > > >>>>>> > > >>>>>> I don't know enough about them to have an opinion. I keep seeing the > > >>>>>> drawback of needing to synchronize both allocation & adding in all > > >>>>>> the > > >>>>>> places we want to modify the IOVATree. At this moment, these are the > > >>>>>> vhost-vdpa memory listener, the SVQ vring creation and removal, and > > >>>>>> net CVQ buffers. But it may be more in the future. > > >>>>>> > > >>>>>> What are the advantages of keeping these separated that justifies > > >>>>>> needing to synchronize in all these places, compared with keeping > > >>>>>> them > > >>>>>> synchronized in VhostIOVATree? > > >>>>> > > >>>>> It doesn't need to be synchronized. > > >>>>> > > >>>>> Assuming guest and SVQ shares IOVA range. IOVA only needs to track > > >>>>> which part of the range has been used. > > >>>>> > > >>>> > > >>>> Not sure if I follow, that is what I mean with "synchronized". > > >>> > > >>> Oh right. > > >>> > > >>>> > > >>>>> This simplifies things, we can store GPA->IOVA mappings and SVQ -> > > >>>>> IOVA mappings separately. > > >>>>> > > >>>> > > >>>> Sorry, I still cannot see the whole picture :). > > >>>> > > >>>> Let's assume we have all the GPA mapped to specific IOVA regions, so > > >>>> we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ > > >>>> because of the migration. How can we know where we can place SVQ > > >>>> vrings without having them synchronized? > > >>> > > >>> Just allocating a new IOVA range for SVQ? > > >>> > > >>>> > > >>>> At this moment we're using a tree. The tree nature of the current SVQ > > >>>> IOVA -> VA makes all nodes ordered so it is more or less easy to look > > >>>> for holes. > > >>> > > >>> Yes, iova allocate could still be implemented via a tree. > > >>> > > >>>> > > >>>> Now your proposal uses the SVQ IOVA as tree values. Should we iterate > > >>>> over all of them, order them, of the two trees, and then look for > > >>>> holes there? > > >>> > > >>> Let me clarify, correct me if I was wrong: > > >>> > > >>> 1) IOVA allocator is still implemented via a tree, we just don't need > > >>> to store how the IOVA is used > > >>> 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in > > >>> the datapath SVQ translation > > >>> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ > > >>> > > >> > > >> Ok, so the part I was missing is that now we have 3 whole trees, with > > >> somehow redundant information :). > > > > > > Somehow, it decouples the IOVA usage out of the IOVA allocator. This > > > might be simple as guests and SVQ may try to share a single IOVA > > > address space. > > > > > > > I'm working on implementing your three suggestions above but I'm a bit > > confused with some of the wording and I was hoping you could clarify > > some of it for me when you get the chance. > > > > --- > > For your first suggestion (1) you mention decoupling the IOVA allocator > > and "don't need to store how the IOVA is used." > > > > By this, do you mean to not save the IOVA->HVA mapping and instead only > > save the allocated IOVA ranges?
Yes. > > In other words, are you suggesting to > > create a dedicated "IOVA->IOVA" tree like: > > > > struct VhostIOVATree { > > uint64_t iova_first; > > uint64_t iova_last; > > IOVATree *iova_map; > > }; It could be this or other. I think the point is the allocator is only used for IOVA allocation but it doesn't have any information about the mapping. > > > > Where the mapping might look something like (where translated_addr is > > given some kind of 0 value): > > > > iova_region = (DMAMap) { > > .iova = iova_first, > > .translated_addr = 0, > > .size = region_size - 1, > > .perm = IOMMU_ACCESS_FLAG(true, section->readonly), > > }; > > > > Also, if this is what you mean by decoupling the IOVA allocator, what > > happens to the IOVA->HVA tree? We will have two structures: 1) IOVA domain (in charge of IOVA range allocation/deallocation) 2) How the IOVA is used, e.g an IOVA->HVA tree > Are we no longer saving these mappings in > > a tree? > > > > --- > > In your second suggestion (2) with a dedicated GPA->IOVA tree, were you > > thinking something like this? Just adding on to VhostIOVATree here: > > > > struct VhostIOVATree { > > uint64_t iova_first; > > uint64_t iova_last; > > IOVATree *iova_map; > > IOVATree *gpa_map; > > }; I'd split the mappings if it's possible. > > > > Where the mapping might look something like: > > > > gpa_region = (DMAMap) { > > .iova = iova_first, > > .translated_addr = gpa_first, > > .size = region_size - 1, > > .perm = IOMMU_ACCESS_FLAG(true, section->readonly), > > }; > > > > Also, when you say "used in the datapath SVQ translation", we still need > > to build the GPA->IOVA tree when vhost_vdpa_listener_region_add() is > > called, right? Right. > > > > --- > > Lastly, in your third suggestion (3) you mention implementing a > > SVQ->IOVA tree, making the total number of IOVATrees/GTrees 3: one for > > just IOVAs, one for GPA->IOVA, and the last one for SVQ->IOVA. E.g. > > > > struct VhostIOVATree { > > uint64_t iova_first; > > uint64_t iova_last; > > IOVATree *iova_map; > > IOVATree *gpa_map; > > IOVATree *svq_map; > > }; > > > > --- > > > > Let me know if I'm understanding this correctly. If I am, this would > > require a pretty good amount of refactoring on the IOVA allocation, > > searching, destroying, etc. code to account for these new trees. > > > > Ok I think I understand your previous question better, sorry for the delay :). > > If I'm not wrong, Jason did not enumerate these as alternatives but as > part of the same solution. Jason please correct me if I'm wrong here. > > His solution is composed of three trees: > 1) One for the IOVA allocations, so we know where to allocate new ranges > 2) One of the GPA -> SVQ IOVA translations. > 3) Another one for SVQ vrings translations. > Exactly. > In my opinion to use GPA this is problematic as we force all of the > callers to know if the address we want to translate comes from the > guest or not. Current code does now know it, as > vhost_svq_translate_addr is called to translate both buffer dataplane > addresses and control virtqueue buffers, which are also shadowed. To > transmit that information to the caller code would require either > duplicate functions, to add a boolean, etc, as it is in a different > layer (net specific net/vhost-vdpa.c vs generic > hw/virtio/vhost-shadow-virtqueue.c). > > In my opinion is easier to keep just two trees (or similar structs): > 1) One for the IOVA allocations, so we know where to allocate new > ranges. We don't actually need to store the translations here. > 2) One for HVA -> SVQ IOVA translations. > > This way we can accommodate both SVQ vrings, CVQ buffers, and guest > memory buffers, all on the second tree, and take care of the HVA > duplications. Probably, but how to take care of HVA duplications (I may miss the context, sorry)? Thanks > > Thanks! > > > Thanks Jason! > > > > >> > > >> In some sense this is simpler than trying to get all the information > > >> from only two trees. On the bad side, all SVQ calls that allocate some > > >> region need to remember to add to one of the two other threes. That is > > >> what I mean by synchronized. But sure, we can go that way. > > > > > > Just a suggestion, if it turns out to complicate the issue, I'm fine > > > to go the other way. > > > > > > Thanks > > > > > >> > > >>> Thanks > > >>> > > >>>> > > >>>>> Thanks > > >>>>> > > >>>>>> > > >>>>>> Thanks! > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > >> > > > > > >