On Tue, Feb 01, 2022 at 10:34:32PM -0700, Alex Williamson wrote:
> On Wed, 2 Feb 2022 01:13:22 +0000
> Jag Raman <jag.ra...@oracle.com> wrote:
> 
> > > On Feb 1, 2022, at 5:47 PM, Alex Williamson <alex.william...@redhat.com> 
> > > wrote:
> > > 
> > > On Tue, 1 Feb 2022 21:24:08 +0000
> > > Jag Raman <jag.ra...@oracle.com> wrote:
> > >   
> > >>> On Feb 1, 2022, at 10:24 AM, Alex Williamson 
> > >>> <alex.william...@redhat.com> wrote:
> > >>> 
> > >>> On Tue, 1 Feb 2022 09:30:35 +0000
> > >>> Stefan Hajnoczi <stefa...@redhat.com> wrote:
> > >>>   
> > >>>> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:    
> > >>>>> On Fri, 28 Jan 2022 09:18:08 +0000
> > >>>>> Stefan Hajnoczi <stefa...@redhat.com> wrote:
> > >>>>>   
> > >>>>>> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:     
> > >>>>>>  
> > >>>>>>> If the goal here is to restrict DMA between devices, ie. 
> > >>>>>>> peer-to-peer
> > >>>>>>> (p2p), why are we trying to re-invent what an IOMMU already does?   
> > >>>>>>>      
> > >>>>>> 
> > >>>>>> The issue Dave raised is that vfio-user servers run in separate
> > >>>>>> processses from QEMU with shared memory access to RAM but no direct
> > >>>>>> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
> > >>>>>> example of a non-RAM MemoryRegion that can be the source/target of 
> > >>>>>> DMA
> > >>>>>> requests.
> > >>>>>> 
> > >>>>>> I don't think IOMMUs solve this problem but luckily the vfio-user
> > >>>>>> protocol already has messages that vfio-user servers can use as a
> > >>>>>> fallback when DMA cannot be completed through the shared memory RAM
> > >>>>>> accesses.
> > >>>>>>   
> > >>>>>>> In
> > >>>>>>> fact, it seems like an IOMMU does this better in providing an IOVA
> > >>>>>>> address space per BDF.  Is the dynamic mapping overhead too much?  
> > >>>>>>> What
> > >>>>>>> physical hardware properties or specifications could we leverage to
> > >>>>>>> restrict p2p mappings to a device?  Should it be governed by machine
> > >>>>>>> type to provide consistency between devices?  Should each "isolated"
> > >>>>>>> bus be in a separate root complex?  Thanks,        
> > >>>>>> 
> > >>>>>> There is a separate issue in this patch series regarding isolating 
> > >>>>>> the
> > >>>>>> address space where BAR accesses are made (i.e. the global
> > >>>>>> address_space_memory/io). When one process hosts multiple vfio-user
> > >>>>>> server instances (e.g. a software-defined network switch with 
> > >>>>>> multiple
> > >>>>>> ethernet devices) then each instance needs isolated memory and io 
> > >>>>>> address
> > >>>>>> spaces so that vfio-user clients don't cause collisions when they map
> > >>>>>> BARs to the same address.
> > >>>>>> 
> > >>>>>> I think the the separate root complex idea is a good solution. This
> > >>>>>> patch series takes a different approach by adding the concept of
> > >>>>>> isolated address spaces into hw/pci/.      
> > >>>>> 
> > >>>>> This all still seems pretty sketchy, BARs cannot overlap within the
> > >>>>> same vCPU address space, perhaps with the exception of when they're
> > >>>>> being sized, but DMA should be disabled during sizing.
> > >>>>> 
> > >>>>> Devices within the same VM context with identical BARs would need to
> > >>>>> operate in different address spaces.  For example a translation offset
> > >>>>> in the vCPU address space would allow unique addressing to the 
> > >>>>> devices,
> > >>>>> perhaps using the translation offset bits to address a root complex 
> > >>>>> and
> > >>>>> masking those bits for downstream transactions.
> > >>>>> 
> > >>>>> In general, the device simply operates in an address space, ie. an
> > >>>>> IOVA.  When a mapping is made within that address space, we perform a
> > >>>>> translation as necessary to generate a guest physical address.  The
> > >>>>> IOVA itself is only meaningful within the context of the address 
> > >>>>> space,
> > >>>>> there is no requirement or expectation for it to be globally unique.
> > >>>>> 
> > >>>>> If the vfio-user server is making some sort of requirement that IOVAs
> > >>>>> are unique across all devices, that seems very, very wrong.  Thanks,  
> > >>>>>     
> > >>>> 
> > >>>> Yes, BARs and IOVAs don't need to be unique across all devices.
> > >>>> 
> > >>>> The issue is that there can be as many guest physical address spaces as
> > >>>> there are vfio-user clients connected, so per-client isolated address
> > >>>> spaces are required. This patch series has a solution to that problem
> > >>>> with the new pci_isol_as_mem/io() API.    
> > >>> 
> > >>> Sorry, this still doesn't follow for me.  A server that hosts multiple
> > >>> devices across many VMs (I'm not sure if you're referring to the device
> > >>> or the VM as a client) needs to deal with different address spaces per
> > >>> device.  The server needs to be able to uniquely identify every DMA,
> > >>> which must be part of the interface protocol.  But I don't see how that
> > >>> imposes a requirement of an isolated address space.  If we want the
> > >>> device isolated because we don't trust the server, that's where an IOMMU
> > >>> provides per device isolation.  What is the restriction of the
> > >>> per-client isolated address space and why do we need it?  The server
> > >>> needing to support multiple clients is not a sufficient answer to
> > >>> impose new PCI bus types with an implicit restriction on the VM.    
> > >> 
> > >> Hi Alex,
> > >> 
> > >> I believe there are two separate problems with running PCI devices in
> > >> the vfio-user server. The first one is concerning memory isolation and
> > >> second one is vectoring of BAR accesses (as explained below).
> > >> 
> > >> In our previous patches (v3), we used an IOMMU to isolate memory
> > >> spaces. But we still had trouble with the vectoring. So we implemented
> > >> separate address spaces for each PCIBus to tackle both problems
> > >> simultaneously, based on the feedback we got.
> > >> 
> > >> The following gives an overview of issues concerning vectoring of
> > >> BAR accesses.
> > >> 
> > >> The device’s BAR regions are mapped into the guest physical address
> > >> space. The guest writes the guest PA of each BAR into the device’s BAR
> > >> registers. To access the BAR regions of the device, QEMU uses
> > >> address_space_rw() which vectors the physical address access to the
> > >> device BAR region handlers.  
> > > 
> > > The guest physical address written to the BAR is irrelevant from the
> > > device perspective, this only serves to assign the BAR an offset within
> > > the address_space_mem, which is used by the vCPU (and possibly other
> > > devices depending on their address space).  There is no reason for the
> > > device itself to care about this address.  
> > 
> > Thank you for the explanation, Alex!
> > 
> > The confusion at my part is whether we are inside the device already when
> > the server receives a request to access BAR region of a device. Based on
> > your explanation, I get that your view is the BAR access request has
> > propagated into the device already, whereas I was under the impression
> > that the request is still on the CPU side of the PCI root complex.
> 
> If you are getting an access through your MemoryRegionOps, all the
> translations have been made, you simply need to use the hwaddr as the
> offset into the MemoryRegion for the access.  Perform the read/write to
> your device, no further translations required.

The access comes via libvfio-user's vfu_region_access_cb_t callback, not
via MemoryRegionOps. The callback is currently implemented by calling
address_space_rw() on the pci_isol_as_mem/io() address space, depending
on the BAR type. The code in "[PATCH v5 15/18] vfio-user: handle PCI BAR
accesses".

It's possible to reimplement the patch to directly call
memory_region_dispatch_read/write() on r->memory instead of
address_space_rw() as you've described.

> > Your view makes sense to me - once the BAR access request reaches the
> > client (on the other side), we could consider that the request has reached
> > the device.
> > 
> > On a separate note, if devices don’t care about the values in BAR
> > registers, why do the default PCI config handlers intercept and map
> > the BAR region into address_space_mem?
> > (pci_default_write_config() -> pci_update_mappings())
> 
> This is the part that's actually placing the BAR MemoryRegion as a
> sub-region into the vCPU address space.  I think if you track it,
> you'll see PCIDevice.io_regions[i].address_space is actually
> system_memory, which is used to initialize address_space_system.
> 
> The machine assembles PCI devices onto buses as instructed by the
> command line or hot plug operations.  It's the responsibility of the
> guest firmware and guest OS to probe those devices, size the BARs, and
> place the BARs into the memory hierarchy of the PCI bus, ie. system
> memory.  The BARs are necessarily in the "guest physical memory" for
> vCPU access, but it's essentially only coincidental that PCI devices
> might be in an address space that provides a mapping to their own BAR.
> There's no reason to ever use it.

Good, I think nothing uses address_space_system/io when BAR dispatch is
implemented with memory_region_dispatch_read/write() as you suggested.

It would be nice if there was a way to poison address_space_system/io to
abort on dispatch - nothing should use them.

We now have the option of dropping pci_isol_as_mem/io() again and using
->iommu_fn() to return an isolated memory address space containing the
vfio-user client's VFIO_USER_DMA_MAP regions.

Stefan

Attachment: signature.asc
Description: PGP signature

Reply via email to