On Tue, Feb 01, 2022 at 10:34:32PM -0700, Alex Williamson wrote: > On Wed, 2 Feb 2022 01:13:22 +0000 > Jag Raman <jag.ra...@oracle.com> wrote: > > > > On Feb 1, 2022, at 5:47 PM, Alex Williamson <alex.william...@redhat.com> > > > wrote: > > > > > > On Tue, 1 Feb 2022 21:24:08 +0000 > > > Jag Raman <jag.ra...@oracle.com> wrote: > > > > > >>> On Feb 1, 2022, at 10:24 AM, Alex Williamson > > >>> <alex.william...@redhat.com> wrote: > > >>> > > >>> On Tue, 1 Feb 2022 09:30:35 +0000 > > >>> Stefan Hajnoczi <stefa...@redhat.com> wrote: > > >>> > > >>>> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote: > > >>>>> On Fri, 28 Jan 2022 09:18:08 +0000 > > >>>>> Stefan Hajnoczi <stefa...@redhat.com> wrote: > > >>>>> > > >>>>>> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote: > > >>>>>> > > >>>>>>> If the goal here is to restrict DMA between devices, ie. > > >>>>>>> peer-to-peer > > >>>>>>> (p2p), why are we trying to re-invent what an IOMMU already does? > > >>>>>>> > > >>>>>> > > >>>>>> The issue Dave raised is that vfio-user servers run in separate > > >>>>>> processses from QEMU with shared memory access to RAM but no direct > > >>>>>> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one > > >>>>>> example of a non-RAM MemoryRegion that can be the source/target of > > >>>>>> DMA > > >>>>>> requests. > > >>>>>> > > >>>>>> I don't think IOMMUs solve this problem but luckily the vfio-user > > >>>>>> protocol already has messages that vfio-user servers can use as a > > >>>>>> fallback when DMA cannot be completed through the shared memory RAM > > >>>>>> accesses. > > >>>>>> > > >>>>>>> In > > >>>>>>> fact, it seems like an IOMMU does this better in providing an IOVA > > >>>>>>> address space per BDF. Is the dynamic mapping overhead too much? > > >>>>>>> What > > >>>>>>> physical hardware properties or specifications could we leverage to > > >>>>>>> restrict p2p mappings to a device? Should it be governed by machine > > >>>>>>> type to provide consistency between devices? Should each "isolated" > > >>>>>>> bus be in a separate root complex? Thanks, > > >>>>>> > > >>>>>> There is a separate issue in this patch series regarding isolating > > >>>>>> the > > >>>>>> address space where BAR accesses are made (i.e. the global > > >>>>>> address_space_memory/io). When one process hosts multiple vfio-user > > >>>>>> server instances (e.g. a software-defined network switch with > > >>>>>> multiple > > >>>>>> ethernet devices) then each instance needs isolated memory and io > > >>>>>> address > > >>>>>> spaces so that vfio-user clients don't cause collisions when they map > > >>>>>> BARs to the same address. > > >>>>>> > > >>>>>> I think the the separate root complex idea is a good solution. This > > >>>>>> patch series takes a different approach by adding the concept of > > >>>>>> isolated address spaces into hw/pci/. > > >>>>> > > >>>>> This all still seems pretty sketchy, BARs cannot overlap within the > > >>>>> same vCPU address space, perhaps with the exception of when they're > > >>>>> being sized, but DMA should be disabled during sizing. > > >>>>> > > >>>>> Devices within the same VM context with identical BARs would need to > > >>>>> operate in different address spaces. For example a translation offset > > >>>>> in the vCPU address space would allow unique addressing to the > > >>>>> devices, > > >>>>> perhaps using the translation offset bits to address a root complex > > >>>>> and > > >>>>> masking those bits for downstream transactions. > > >>>>> > > >>>>> In general, the device simply operates in an address space, ie. an > > >>>>> IOVA. When a mapping is made within that address space, we perform a > > >>>>> translation as necessary to generate a guest physical address. The > > >>>>> IOVA itself is only meaningful within the context of the address > > >>>>> space, > > >>>>> there is no requirement or expectation for it to be globally unique. > > >>>>> > > >>>>> If the vfio-user server is making some sort of requirement that IOVAs > > >>>>> are unique across all devices, that seems very, very wrong. Thanks, > > >>>>> > > >>>> > > >>>> Yes, BARs and IOVAs don't need to be unique across all devices. > > >>>> > > >>>> The issue is that there can be as many guest physical address spaces as > > >>>> there are vfio-user clients connected, so per-client isolated address > > >>>> spaces are required. This patch series has a solution to that problem > > >>>> with the new pci_isol_as_mem/io() API. > > >>> > > >>> Sorry, this still doesn't follow for me. A server that hosts multiple > > >>> devices across many VMs (I'm not sure if you're referring to the device > > >>> or the VM as a client) needs to deal with different address spaces per > > >>> device. The server needs to be able to uniquely identify every DMA, > > >>> which must be part of the interface protocol. But I don't see how that > > >>> imposes a requirement of an isolated address space. If we want the > > >>> device isolated because we don't trust the server, that's where an IOMMU > > >>> provides per device isolation. What is the restriction of the > > >>> per-client isolated address space and why do we need it? The server > > >>> needing to support multiple clients is not a sufficient answer to > > >>> impose new PCI bus types with an implicit restriction on the VM. > > >> > > >> Hi Alex, > > >> > > >> I believe there are two separate problems with running PCI devices in > > >> the vfio-user server. The first one is concerning memory isolation and > > >> second one is vectoring of BAR accesses (as explained below). > > >> > > >> In our previous patches (v3), we used an IOMMU to isolate memory > > >> spaces. But we still had trouble with the vectoring. So we implemented > > >> separate address spaces for each PCIBus to tackle both problems > > >> simultaneously, based on the feedback we got. > > >> > > >> The following gives an overview of issues concerning vectoring of > > >> BAR accesses. > > >> > > >> The device’s BAR regions are mapped into the guest physical address > > >> space. The guest writes the guest PA of each BAR into the device’s BAR > > >> registers. To access the BAR regions of the device, QEMU uses > > >> address_space_rw() which vectors the physical address access to the > > >> device BAR region handlers. > > > > > > The guest physical address written to the BAR is irrelevant from the > > > device perspective, this only serves to assign the BAR an offset within > > > the address_space_mem, which is used by the vCPU (and possibly other > > > devices depending on their address space). There is no reason for the > > > device itself to care about this address. > > > > Thank you for the explanation, Alex! > > > > The confusion at my part is whether we are inside the device already when > > the server receives a request to access BAR region of a device. Based on > > your explanation, I get that your view is the BAR access request has > > propagated into the device already, whereas I was under the impression > > that the request is still on the CPU side of the PCI root complex. > > If you are getting an access through your MemoryRegionOps, all the > translations have been made, you simply need to use the hwaddr as the > offset into the MemoryRegion for the access. Perform the read/write to > your device, no further translations required.
The access comes via libvfio-user's vfu_region_access_cb_t callback, not via MemoryRegionOps. The callback is currently implemented by calling address_space_rw() on the pci_isol_as_mem/io() address space, depending on the BAR type. The code in "[PATCH v5 15/18] vfio-user: handle PCI BAR accesses". It's possible to reimplement the patch to directly call memory_region_dispatch_read/write() on r->memory instead of address_space_rw() as you've described. > > Your view makes sense to me - once the BAR access request reaches the > > client (on the other side), we could consider that the request has reached > > the device. > > > > On a separate note, if devices don’t care about the values in BAR > > registers, why do the default PCI config handlers intercept and map > > the BAR region into address_space_mem? > > (pci_default_write_config() -> pci_update_mappings()) > > This is the part that's actually placing the BAR MemoryRegion as a > sub-region into the vCPU address space. I think if you track it, > you'll see PCIDevice.io_regions[i].address_space is actually > system_memory, which is used to initialize address_space_system. > > The machine assembles PCI devices onto buses as instructed by the > command line or hot plug operations. It's the responsibility of the > guest firmware and guest OS to probe those devices, size the BARs, and > place the BARs into the memory hierarchy of the PCI bus, ie. system > memory. The BARs are necessarily in the "guest physical memory" for > vCPU access, but it's essentially only coincidental that PCI devices > might be in an address space that provides a mapping to their own BAR. > There's no reason to ever use it. Good, I think nothing uses address_space_system/io when BAR dispatch is implemented with memory_region_dispatch_read/write() as you suggested. It would be nice if there was a way to poison address_space_system/io to abort on dispatch - nothing should use them. We now have the option of dropping pci_isol_as_mem/io() again and using ->iommu_fn() to return an isolated memory address space containing the vfio-user client's VFIO_USER_DMA_MAP regions. Stefan
signature.asc
Description: PGP signature