On Wed, 19 Feb 2025 11:58:44 -0700 Alex Williamson <alex.william...@redhat.com> wrote:
> On Wed, 19 Feb 2025 18:58:58 +0100 > Eric Auger <eric.au...@redhat.com> wrote: > > > Since kernel commit: > > 2b2c651baf1c ("vfio/pci: Invalidate mmaps and block the access > > in D3hot power state") > > any attempt to do an mmap access to a BAR when the device is in d3hot > > state will generate a fault. > > > > On system_powerdown, if the VFIO device is translated by an IOMMU, > > the device is moved to D3hot state and then the vIOMMU gets disabled > > by the guest. As a result of this later operation, the address space is > > swapped from translated to untranslated. When re-enabling the aliased > > regions, the RAM regions are dma-mapped again and this causes DMA_MAP > > faults when attempting the operation on BARs. > > > > To avoid doing the remap on those BARs, we compute whether the > > device is in D3hot state and if so, skip the DMA MAP. > > Thinking on this some more, QEMU PCI code already manages the device > BARs appearing in the address space based on the memory enable bit in > the command register. Should we do the same for PM state? > > IOW, the device going into low power state should remove the BARs from > the AddressSpace and waking the device should re-add them. The BAR DMA > mapping should then always be consistent, whereas here nothing would > remap the BARs when the device is woken. > > I imagine we'd need an interface to register the PM capability with the > core QEMU PCI code, where address space updates are performed relative > to both memory enable and power status. There might be a way to > implement this just for vfio-pci devices by toggling the enable state > of the BAR mmaps relative to PM state, but doing it at the PCI core > level seems like it'd provide behavior more true to physical hardware. I took a stab at this approach here, it doesn't obviously break anything in my configs, but I haven't yet tried to reproduce this exact scenario. https://gitlab.com/alex.williamson/qemu/-/tree/pci-pm-power-state There's another pm_cap on the PCIExpressDevice that needs to be consolidated as well, once I do some research to figure out why a non-express capability is tracked only by express devices and what they're doing with it. Thanks, Alex