> On Feb 10, 2022, at 5:53 PM, Michael S. Tsirkin <m...@redhat.com> wrote: > > On Thu, Feb 10, 2022 at 10:23:01PM +0000, Jag Raman wrote: >> >> >>> On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin <m...@redhat.com> wrote: >>> >>> On Thu, Feb 10, 2022 at 12:08:27AM +0000, Jag Raman wrote: >>>> >>>> >>>>> On Feb 2, 2022, at 12:34 AM, Alex Williamson <alex.william...@redhat.com> >>>>> wrote: >>>>> >>>>> On Wed, 2 Feb 2022 01:13:22 +0000 >>>>> Jag Raman <jag.ra...@oracle.com> wrote: >>>>> >>>>>>> On Feb 1, 2022, at 5:47 PM, Alex Williamson >>>>>>> <alex.william...@redhat.com> wrote: >>>>>>> >>>>>>> On Tue, 1 Feb 2022 21:24:08 +0000 >>>>>>> Jag Raman <jag.ra...@oracle.com> wrote: >>>>>>> >>>>>>>>> On Feb 1, 2022, at 10:24 AM, Alex Williamson >>>>>>>>> <alex.william...@redhat.com> wrote: >>>>>>>>> >>>>>>>>> On Tue, 1 Feb 2022 09:30:35 +0000 >>>>>>>>> Stefan Hajnoczi <stefa...@redhat.com> wrote: >>>>>>>>> >>>>>>>>>> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote: >>>>>>>>>>> On Fri, 28 Jan 2022 09:18:08 +0000 >>>>>>>>>>> Stefan Hajnoczi <stefa...@redhat.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote: >>>>>>>>>>>> >>>>>>>>>>>>> If the goal here is to restrict DMA between devices, ie. >>>>>>>>>>>>> peer-to-peer >>>>>>>>>>>>> (p2p), why are we trying to re-invent what an IOMMU already does? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> The issue Dave raised is that vfio-user servers run in separate >>>>>>>>>>>> processses from QEMU with shared memory access to RAM but no direct >>>>>>>>>>>> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one >>>>>>>>>>>> example of a non-RAM MemoryRegion that can be the source/target of >>>>>>>>>>>> DMA >>>>>>>>>>>> requests. >>>>>>>>>>>> >>>>>>>>>>>> I don't think IOMMUs solve this problem but luckily the vfio-user >>>>>>>>>>>> protocol already has messages that vfio-user servers can use as a >>>>>>>>>>>> fallback when DMA cannot be completed through the shared memory RAM >>>>>>>>>>>> accesses. >>>>>>>>>>>> >>>>>>>>>>>>> In >>>>>>>>>>>>> fact, it seems like an IOMMU does this better in providing an IOVA >>>>>>>>>>>>> address space per BDF. Is the dynamic mapping overhead too much? >>>>>>>>>>>>> What >>>>>>>>>>>>> physical hardware properties or specifications could we leverage >>>>>>>>>>>>> to >>>>>>>>>>>>> restrict p2p mappings to a device? Should it be governed by >>>>>>>>>>>>> machine >>>>>>>>>>>>> type to provide consistency between devices? Should each >>>>>>>>>>>>> "isolated" >>>>>>>>>>>>> bus be in a separate root complex? Thanks, >>>>>>>>>>>> >>>>>>>>>>>> There is a separate issue in this patch series regarding isolating >>>>>>>>>>>> the >>>>>>>>>>>> address space where BAR accesses are made (i.e. the global >>>>>>>>>>>> address_space_memory/io). When one process hosts multiple vfio-user >>>>>>>>>>>> server instances (e.g. a software-defined network switch with >>>>>>>>>>>> multiple >>>>>>>>>>>> ethernet devices) then each instance needs isolated memory and io >>>>>>>>>>>> address >>>>>>>>>>>> spaces so that vfio-user clients don't cause collisions when they >>>>>>>>>>>> map >>>>>>>>>>>> BARs to the same address. >>>>>>>>>>>> >>>>>>>>>>>> I think the the separate root complex idea is a good solution. This >>>>>>>>>>>> patch series takes a different approach by adding the concept of >>>>>>>>>>>> isolated address spaces into hw/pci/. >>>>>>>>>>> >>>>>>>>>>> This all still seems pretty sketchy, BARs cannot overlap within the >>>>>>>>>>> same vCPU address space, perhaps with the exception of when they're >>>>>>>>>>> being sized, but DMA should be disabled during sizing. >>>>>>>>>>> >>>>>>>>>>> Devices within the same VM context with identical BARs would need to >>>>>>>>>>> operate in different address spaces. For example a translation >>>>>>>>>>> offset >>>>>>>>>>> in the vCPU address space would allow unique addressing to the >>>>>>>>>>> devices, >>>>>>>>>>> perhaps using the translation offset bits to address a root complex >>>>>>>>>>> and >>>>>>>>>>> masking those bits for downstream transactions. >>>>>>>>>>> >>>>>>>>>>> In general, the device simply operates in an address space, ie. an >>>>>>>>>>> IOVA. When a mapping is made within that address space, we perform >>>>>>>>>>> a >>>>>>>>>>> translation as necessary to generate a guest physical address. The >>>>>>>>>>> IOVA itself is only meaningful within the context of the address >>>>>>>>>>> space, >>>>>>>>>>> there is no requirement or expectation for it to be globally unique. >>>>>>>>>>> >>>>>>>>>>> If the vfio-user server is making some sort of requirement that >>>>>>>>>>> IOVAs >>>>>>>>>>> are unique across all devices, that seems very, very wrong. >>>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Yes, BARs and IOVAs don't need to be unique across all devices. >>>>>>>>>> >>>>>>>>>> The issue is that there can be as many guest physical address spaces >>>>>>>>>> as >>>>>>>>>> there are vfio-user clients connected, so per-client isolated address >>>>>>>>>> spaces are required. This patch series has a solution to that problem >>>>>>>>>> with the new pci_isol_as_mem/io() API. >>>>>>>>> >>>>>>>>> Sorry, this still doesn't follow for me. A server that hosts multiple >>>>>>>>> devices across many VMs (I'm not sure if you're referring to the >>>>>>>>> device >>>>>>>>> or the VM as a client) needs to deal with different address spaces per >>>>>>>>> device. The server needs to be able to uniquely identify every DMA, >>>>>>>>> which must be part of the interface protocol. But I don't see how >>>>>>>>> that >>>>>>>>> imposes a requirement of an isolated address space. If we want the >>>>>>>>> device isolated because we don't trust the server, that's where an >>>>>>>>> IOMMU >>>>>>>>> provides per device isolation. What is the restriction of the >>>>>>>>> per-client isolated address space and why do we need it? The server >>>>>>>>> needing to support multiple clients is not a sufficient answer to >>>>>>>>> impose new PCI bus types with an implicit restriction on the VM. >>>>>>>> >>>>>>>> Hi Alex, >>>>>>>> >>>>>>>> I believe there are two separate problems with running PCI devices in >>>>>>>> the vfio-user server. The first one is concerning memory isolation and >>>>>>>> second one is vectoring of BAR accesses (as explained below). >>>>>>>> >>>>>>>> In our previous patches (v3), we used an IOMMU to isolate memory >>>>>>>> spaces. But we still had trouble with the vectoring. So we implemented >>>>>>>> separate address spaces for each PCIBus to tackle both problems >>>>>>>> simultaneously, based on the feedback we got. >>>>>>>> >>>>>>>> The following gives an overview of issues concerning vectoring of >>>>>>>> BAR accesses. >>>>>>>> >>>>>>>> The device’s BAR regions are mapped into the guest physical address >>>>>>>> space. The guest writes the guest PA of each BAR into the device’s BAR >>>>>>>> registers. To access the BAR regions of the device, QEMU uses >>>>>>>> address_space_rw() which vectors the physical address access to the >>>>>>>> device BAR region handlers. >>>>>>> >>>>>>> The guest physical address written to the BAR is irrelevant from the >>>>>>> device perspective, this only serves to assign the BAR an offset within >>>>>>> the address_space_mem, which is used by the vCPU (and possibly other >>>>>>> devices depending on their address space). There is no reason for the >>>>>>> device itself to care about this address. >>>>>> >>>>>> Thank you for the explanation, Alex! >>>>>> >>>>>> The confusion at my part is whether we are inside the device already when >>>>>> the server receives a request to access BAR region of a device. Based on >>>>>> your explanation, I get that your view is the BAR access request has >>>>>> propagated into the device already, whereas I was under the impression >>>>>> that the request is still on the CPU side of the PCI root complex. >>>>> >>>>> If you are getting an access through your MemoryRegionOps, all the >>>>> translations have been made, you simply need to use the hwaddr as the >>>>> offset into the MemoryRegion for the access. Perform the read/write to >>>>> your device, no further translations required. >>>>> >>>>>> Your view makes sense to me - once the BAR access request reaches the >>>>>> client (on the other side), we could consider that the request has >>>>>> reached >>>>>> the device. >>>>>> >>>>>> On a separate note, if devices don’t care about the values in BAR >>>>>> registers, why do the default PCI config handlers intercept and map >>>>>> the BAR region into address_space_mem? >>>>>> (pci_default_write_config() -> pci_update_mappings()) >>>>> >>>>> This is the part that's actually placing the BAR MemoryRegion as a >>>>> sub-region into the vCPU address space. I think if you track it, >>>>> you'll see PCIDevice.io_regions[i].address_space is actually >>>>> system_memory, which is used to initialize address_space_system. >>>>> >>>>> The machine assembles PCI devices onto buses as instructed by the >>>>> command line or hot plug operations. It's the responsibility of the >>>>> guest firmware and guest OS to probe those devices, size the BARs, and >>>>> place the BARs into the memory hierarchy of the PCI bus, ie. system >>>>> memory. The BARs are necessarily in the "guest physical memory" for >>>>> vCPU access, but it's essentially only coincidental that PCI devices >>>>> might be in an address space that provides a mapping to their own BAR. >>>>> There's no reason to ever use it. >>>>> >>>>> In the vIOMMU case, we can't know that the device address space >>>>> includes those BAR mappings or if they do, that they're identity mapped >>>>> to the physical address. Devices really need to not infer anything >>>>> about an address. Think about real hardware, a device is told by >>>>> driver programming to perform a DMA operation. The device doesn't know >>>>> the target of that operation, it's the guest driver's responsibility to >>>>> make sure the IOVA within the device address space is valid and maps to >>>>> the desired target. Thanks, >>>> >>>> Thanks for the explanation, Alex. Thanks to everyone else in the thread who >>>> helped to clarify this problem. >>>> >>>> We have implemented the memory isolation based on the discussion in the >>>> thread. We will send the patches out shortly. >>>> >>>> Devices such as “name" and “e1000” worked fine. But I’d like to note that >>>> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem >>>> to be IOMMU aware. In LSI’s case, the kernel driver is asking the device to >>>> read instructions from the CPU VA (lsi_execute_script() -> read_dword()), >>>> which is forbidden when IOMMU is enabled. Specifically, the driver is >>>> asking >>>> the device to access other BAR regions by using the BAR address programmed >>>> in the PCI config space. This happens even without vfio-user patches. For >>>> example, >>>> we could enable IOMMU using “-device intel-iommu” QEMU option and also >>>> adding the following to the kernel command-line: “intel_iommu=on >>>> iommu=nopt”. >>>> In this case, we could see an IOMMU fault. >>> >>> So, device accessing its own BAR is different. Basically, these >>> transactions never go on the bus at all, never mind get to the IOMMU. >> >> Hi Michael, >> >> In LSI case, I did notice that it went to the IOMMU. > > Hmm do you mean you analyzed how a physical device works? > Or do you mean in QEMU?
I mean in QEMU, I did not analyze a physical device. > >> The device is reading the BAR >> address as if it was a DMA address. > > I got that, my understanding of PCI was that a device can > not be both a master and a target of a transaction at > the same time though. Could not find this in the spec though, > maybe I remember incorrectly. I see, OK. If this were to happen in a real device, PCI would raise an error because the master and target of a transaction can’t be the same. So you believe that this access is handled inside the device, and doesn’t go out. Thanks! -- Jag > >>> I think it's just used as a handle to address internal device memory. >>> This kind of trick is not universal, but not terribly unusual. >>> >>> >>>> Unfortunately, we started off our project with the LSI device. So that >>>> lead to all the >>>> confusion about what is expected at the server end in-terms of >>>> vectoring/address-translation. It gave an impression as if the request was >>>> still on >>>> the CPU side of the PCI root complex, but the actual problem was with the >>>> device driver itself. >>>> >>>> I’m wondering how to deal with this problem. Would it be OK if we mapped >>>> the >>>> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR >>>> registers? >>>> This would help devices such as LSI to circumvent this problem. One problem >>>> with this approach is that it has the potential to collide with another >>>> legitimate >>>> IOVA address. Kindly share your thought on this. >>>> >>>> Thank you! >>> >>> I am not 100% sure what do you plan to do but it sounds fine since even >>> if it collides, with traditional PCI device must never initiate cycles >> >> OK sounds good, I’ll create a mapping of the device BARs in the IOVA. >> >> Thank you! >> -- >> Jag >> >>> within their own BAR range, and PCIe is software-compatible with PCI. So >>> devices won't be able to access this IOVA even if it was programmed in >>> the IOMMU. >>> >>> As was mentioned elsewhere on this thread, devices accessing each >>> other's BAR is a different matter. >>> >>> I do not remember which rules apply to multiple functions of a >>> multi-function device though. I think in a traditional PCI >>> they will never go out on the bus, but with e.g. SRIOV they >>> would probably do go out? Alex, any idea? >>> >>> >>>> -- >>>> Jag >>>> >>>>> >>>>> Alex