On Thu, Oct 10, 2024 at 9:00 AM Si-Wei Liu <si-wei....@oracle.com> wrote: > > > > On 10/9/2024 2:29 AM, Eugenio Perez Martin wrote: > > On Tue, Oct 8, 2024 at 10:30 PM Si-Wei Liu <si-wei....@oracle.com> wrote: > >> > >> > >> On 10/8/2024 8:40 AM, Jonah Palmer wrote: > >>> > >>> On 10/8/24 2:51 AM, Eugenio Perez Martin wrote: > >>>> On Tue, Oct 8, 2024 at 2:14 AM Si-Wei Liu <si-wei....@oracle.com> wrote: > >>>>> > >>>>> > >>>>> On 10/7/2024 6:51 AM, Eugenio Perez Martin wrote: > >>>>>> On Fri, Oct 4, 2024 at 8:48 PM Jonah Palmer > >>>>>> <jonah.pal...@oracle.com> wrote: > >>>>>>> > >>>>>>> On 10/4/24 11:17 AM, Eugenio Perez Martin wrote: > >>>>>>>> On Fri, Oct 4, 2024 at 2:45 PM Jonah Palmer > >>>>>>>> <jonah.pal...@oracle.com> wrote: > >>>>>>>>> Implements the IOVA->GPA tree for handling mapping, unmapping, and > >>>>>>>>> translations for guest memory regions. > >>>>>>>>> > >>>>>>>>> When the guest has overlapping memory regions, an HVA to IOVA > >>>>>>>>> translation > >>>>>>>>> may return an incorrect IOVA when searching the IOVA->HVA tree. > >>>>>>>>> This is > >>>>>>>>> due to one HVA range being contained (overlapping) in another > >>>>>>>>> HVA range > >>>>>>>>> in the IOVA->HVA tree. By creating an IOVA->GPA tree, we can use > >>>>>>>>> GPAs to > >>>>>>>>> translate and find the correct IOVA for guest memory regions. > >>>>>>>>> > >>>>>>>> Yes, this first patch is super close to what I meant, just one issue > >>>>>>>> and a pair of nits here and there. > >>>>>>>> > >>>>>>>> I'd leave the second patch as an optimization on top, if the numbers > >>>>>>>> prove that adding the code is worth it. > >>>>>>>> > >>>>>>> Ah okay, gotcha. I also wasn't sure if what you mentioned below on > >>>>>>> the > >>>>>>> previous series you also wanted implemented or if these would also be > >>>>>>> optimizations on top. > >>>>>>> > >>>>>>> [Adding code to the vhost_iova_tree layer for handling multiple > >>>>>>> buffers > >>>>>>> returned from translation for the memory area where each iovec > >>>>>>> covers]: > >>>>>>> ----------------------------------------------------------------------- > >>>>>>> > >>>>>>> "Let's say that SVQ wants to translate the HVA range > >>>>>>> 0xfeda0000-0xfedd0000. So it makes available for the device two > >>>>>>> chained buffers: One with addr=0x1000 len=0x20000 and the other one > >>>>>>> with addr=(0x20000c1000 len=0x10000). > >>>>>>> > >>>>>>> The VirtIO device should be able to translate these two buffers in > >>>>>>> isolation and chain them. Not optimal but it helps to keep QEMU > >>>>>>> source > >>>>>>> clean, as the device already must support it." > >>>>>>> > >>>>>> This is 100% in the device and QEMU is already able to split the > >>>>>> buffers that way, so we don't need any change in QEMU. > >>>>> Noted that if working with the full HVA tree directly, the internal > >>>>> iova > >>>>> tree linear iterator iova_tree_find_address_iterator() today doesn't > >>>>> guarantee the iova range returned can cover the entire length of the > >>>>> iov, so things could happen like that the aliased range with smaller > >>>>> size (than the requested) happens to be hit first in the linear search > >>>>> and be returned, the fragmentation of which can't be guarded against by > >>>>> the VirtIO device or the DMA API mentioned above. > >>>>> > >>>>> The second patch in this series kind of mitigated this side effect by > >>>>> sorting out the backing ram_block with the help of > >>>>> qemu_ram_block_from_host() in case of guest memory backed iov, so it > >>>>> doesn't really count on vhost_iova_gpa_tree_find_iova() to find the > >>>>> matching IOVA, but instead (somehow implicitly) avoids the > >>>>> fragmentation > >>>>> side effect as mentioned above would never happen. Not saying I like > >>>>> the > >>>>> way how it is implemented, but just wanted to point out the implication > >>>>> if the second patch has to be removed - either add special handling > >>>>> code > >>>>> to the iova-tree iterator sizing the range (same as how range selection > >>>>> based upon permission will be done), or add special code in SVQ > >>>>> layer to > >>>>> deal with fragmented IOVA ranges due to aliasing. > >>>>> > >>>> This special code in SVQ is already there. And it will be needed even > >>>> if we look for the buffers by GPA instead of by vaddr, the same way > >>>> virtqueue_map_desc needs to handle it even if it works with GPA. > >>>> Continuous GPA does not imply continuous vaddr. > >>>> > >>> My apologies if I misunderstood something here regarding size & > >>> permission matching, but I attempted to implement both the size and > >>> permission check to iova_tree_find_address_iterator(), however, > >>> there's a sizing mismatch in the vhost_svq_translate_addr() code path > >>> that's causing vhost-net to fail to start. > >>> > >>> qemu-system-x86_64: unable to start vhost net: 22: falling back on > >>> userspace virtio > >>> > >>> More specifically, in vhost_svq_add_split(), the first call to > >>> vhost_svq_vring_write_descs() returns false: > >>> > >>> ok = vhost_svq_vring_write_descs(svq, sgs, out_sg, out_num, in_num > > >>> 0, false); > >>> if (unlikely(!ok)) { > >>> return false; > >>> } > >>> > >>> Maybe I misunderstood the proposal, but in > >>> iova_tree_find_address_iterator I'm checking for an exact match for > >>> sizes: > >>> > >>> if (map->size != needle->size || map->perm != needle->perm) { > >>> return false; > >>> } > >> The permission needs to exact match, > > Care with this, read only buffers can live in the guest's RW memory. I > > think the only actual condition is that if the buffer is writable, the > > memory area must be writable. If they don't match, we should keep > > looking for the right area. > Indeed. > > > > > > >> while the size doesn't have to. The > >> current iova_tree_find_address_iterator() has the following check: > >> > >> if (map->translated_addr + map->size < needle->translated_addr || > >> needle->translated_addr + needle->size < map->translated_addr) { > >> return false; > >> } > >> > >> So essentially it does make sure the starting translated_addr on the > >> needle is greater than or equal to the starting translated_addr on the > >> map, and the first match of the map range in an ordered linear search > >> will be returned, but it doesn't guarantee the ending address on the > >> needle (needle->translated_addr + needle->size) will be covered by the > >> ending address on the map (map->translated_addr + map->size), so this > >> implementation is subjected to fragmented iova ranges with contiguous > >> HVA address. I don't see the current SVQ handles this well, and the > >> reason of iova fragmentation is due to overlapped region or memory > >> aliasing (unlike the GPA tree implementation, we have no additional info > >> to help us identify the exact IOVA when two or more overlapped HVA > >> ranges are given), which is orthogonal to the HVA fragmentation (with > >> contiguous GPA) handling in virtqueue_map_desc(). > >> > >> How to handle this situation? Well, I guess Eugenio may have different > >> opinion, but to me the only way seems to continue to search till a > >> well-covered IOVA range can be found, or we may have to return multiple > >> IOVA ranges splitting a contiguous HVA buffer... > >> > > Not sure if I followed this 100%, but we must be capable of returning > > multiple IOVA ranges if we trust in SVQ to do the translation anyway. > > > > When SVQ starts, the guest memory listener already gives the GPA -> > > HVA translations fragmented and unordered, as QEMU can hotplug memory > > for example. Let me put a simple example: > > > > GPA [0, 0x1000) -> HVA [0x10000, 0x10100) > > GPA [0x1000, 0x2000) -> HVA [0x20000, 0x20100) > > > > Now we send the IOVA to the device this way: > > IOVA [0x2000, 0x3000) -> HVA [0x20000, 0x20100) > > IOVA [0x3000, 0x4000) -> HVA [0x10000, 0x10100) > > > > So we have this translation tree (assuming we already store them as > > GPA->IOVA): > > GPA [0, 0x1000) -> IOVA [0x3000, 0x4000) > > GPA [0x1000, 0x2000) -> IOVA [0x2000, 0x3000) > > > > Now the guest sends this single buffer in a single descriptor: > > GPA 0x500, len 0x1000. > > > > We need to split it into two descriptors anyway. Without memory > > aliases we're covered at this moment, as virtqueue_pop already solves > > these splits for us in virtqueue_map_desc. > That is what I said above, this case (contiguous GA backed by fragmented > HVA ranges) is well covered by DMA API called by virtqueue_map_desc(). I > was aware this case has been handled in the current SVQ code, i.e. > reflected by below comments in vhost_handle_guest_kick(). This is > definitively clear to me. > > /* > * This condition is possible since a contiguous > buffer in > * GPA does not imply a contiguous buffer in qemu's VA > * scatter-gather segments. If that happens, the buffer > * exposed to the device needs to be a chain of > descriptors > * at this moment. > * > * SVQ cannot hold more available buffers if we are > here: > * queue the current guest descriptor and ignore kicks > * until some elements are used. > */ > > > > Now I realize that SVQ may not be able to cover these splits with > > aliases, as we might need to return more addresses than MAX(out_num, > > in_num) in vhost_svq_vring_write_descs. > Right, this is the case (fragmented iova ranges with contiguous HVA > range due to aliasing) I was referring to, and is not handled at all in > the current SVQ or vhost-iova-tree code. This means we have to handle > the 1:N mapping for each iovec, which essentially introduces unnecessary > complexity rather than simple or straightforward change as claimed to > be. Sorry for speaking straight, but I don't really see major benefit in > either performance, extensibility or flexibility following this > seemingly interim solution. > > On the contrary, instead of simplicity this is becoming more than a > headache now - Jonah and I have to take extra caution of not breaking > any other device which is not using aliased map in the same way as how > the mch device does, as the latter is the only case Jonah used to test > this patch series. That's why in the morning meeting I asked if more > test case can be shared like how Lei Yang did in his previous testing of > your early page pinning series (I recall one failing case seems to be > related to hot plug) as the early pinning would open up Pandora's box > that potentially break a lot of things while memory layout keeps > changing in the early initialization time. > > Even so with one additional test case, it is not sufficient to prove > everything gets done in the right way that can unblock all the memory > aliasing use cases, as our direct customer may use vdpa device in > various VM setups together with whatever devices, GPU, VGA, VNC or other > device we don't know yet but is using aliased map. It'd be going crazy > to exercise every possible device, and I don't see the real benefit to > build yet another sub-system out of QEMU memory system's realm just to > support all these existing devices, given the relevant alias supporting > code is built-in and working quite well within QEMU's own memory system. > Don't get me wrong, it's not me who asked to add this code but patch #2 > is definitely not an optimization patch - we have to make sure the code > is being done in the right way no matter what. Functional correctness > always speak than performance or other thing. > > > So maybe we need to allocate > > the translated addresses array in vhost_svq_vring_write_descs and > > return it, or allow it to grow. > > > > Having said that, I'd keep the plan as building something simple that > > works first, re-prioritize if we want to focus on reducing the > > downtime or in increasing the SVQ throughput, and then act in > > consequence by profiling the advantages of the changes. I'm really > > looking forward to the switching to (GPA|HVA)->IOVA tree, as I'm sure > > it will increase the performance of the solution, but to review all of > > it in one shot is out of my bandwith capabilities at the moment. > No problem, and thank you for your time in reviewing. I think after > almost get done with the other assigned work I can work with Jonah to > implement the desired code changes for (GPA|HVA)->IOVA tree. It's not > for optimization or performance - just I feel it's much simpler than > take extra burden to implement and maintain those unwarranted, > duplicated or preventative code in SVQ or vhost-iova-tree layer just to > get around various issues in memory aliasing or overrlapping use cases. > Actually I'm already getting regretted not giving Jonan the right > guidance in the first place to implement all the necessary virtqueue > layer supporting code than use the ram block API hack, believe me it's > not a lot, and the code change is really simple and easy to follow or > review, there's zero cost in GPA-HVA synchronization than what the > current code already did. Otherwise past lesson learned told me it would > spend 10x more if not building it right in the first place - there'll > be more cost in future time and effort to refactor or rewrite all the > code in later point of time once all kinds of features are already built > up. >
Got it. Looking forward to the next version then!