On Wed, 19 Feb 2025 11:58:44 -0700
Alex Williamson <alex.william...@redhat.com> wrote:

> On Wed, 19 Feb 2025 18:58:58 +0100
> Eric Auger <eric.au...@redhat.com> wrote:
> 
> > Since kernel commit:
> > 2b2c651baf1c ("vfio/pci: Invalidate mmaps and block the access
> > in D3hot power state")
> > any attempt to do an mmap access to a BAR when the device is in d3hot
> > state will generate a fault.
> > 
> > On system_powerdown, if the VFIO device is translated by an IOMMU,
> > the device is moved to D3hot state and then the vIOMMU gets disabled
> > by the guest. As a result of this later operation, the address space is
> > swapped from translated to untranslated. When re-enabling the aliased
> > regions, the RAM regions are dma-mapped again and this causes DMA_MAP
> > faults when attempting the operation on BARs.
> > 
> > To avoid doing the remap on those BARs, we compute whether the
> > device is in D3hot state and if so, skip the DMA MAP.  
> 
> Thinking on this some more, QEMU PCI code already manages the device
> BARs appearing in the address space based on the memory enable bit in
> the command register.  Should we do the same for PM state?
> 
> IOW, the device going into low power state should remove the BARs from
> the AddressSpace and waking the device should re-add them.  The BAR DMA
> mapping should then always be consistent, whereas here nothing would
> remap the BARs when the device is woken.
> 
> I imagine we'd need an interface to register the PM capability with the
> core QEMU PCI code, where address space updates are performed relative
> to both memory enable and power status.  There might be a way to
> implement this just for vfio-pci devices by toggling the enable state
> of the BAR mmaps relative to PM state, but doing it at the PCI core
> level seems like it'd provide behavior more true to physical hardware.

I took a stab at this approach here, it doesn't obviously break
anything in my configs, but I haven't yet tried to reproduce this exact
scenario.

https://gitlab.com/alex.williamson/qemu/-/tree/pci-pm-power-state

There's another pm_cap on the PCIExpressDevice that needs to be
consolidated as well, once I do some research to figure out why a
non-express capability is tracked only by express devices and what
they're doing with it.  Thanks,

Alex


Reply via email to