common: Support device dirty page tracking with vIOMMU

Alex Williamson Fri, 24 Feb 2023 07:59:12 -0800

On Fri, 24 Feb 2023 12:53:26 +0000
Joao Martins <joao.m.mart...@oracle.com> wrote:

> On 24/02/2023 11:25, Joao Martins wrote:
> > On 23/02/2023 23:26, Jason Gunthorpe wrote:  
> >> On Thu, Feb 23, 2023 at 03:33:09PM -0700, Alex Williamson wrote:  
> >>> On Thu, 23 Feb 2023 16:55:54 -0400
> >>> Jason Gunthorpe <j...@nvidia.com> wrote:  
> >>>> On Thu, Feb 23, 2023 at 01:06:33PM -0700, Alex Williamson wrote:
> >>>> Or even better figure out how to get interrupt remapping without IOMMU
> >>>> support :\  
> >>>
> >>> -machine q35,default_bus_bypass_iommu=on,kernel-irqchip=split \
> >>> -device intel-iommu,caching-mode=on,intremap=on  
> >>
> >> Joao?
> >>
> >> If this works lets just block migration if the vIOMMU is turned on..  
> > 
> > At a first glance, this looked like my regular iommu incantation.
> > 
> > But reading the code this ::bypass_iommu (new to me) apparently tells that
> > vIOMMU is bypassed or not for the PCI devices all the way to avoiding
> > enumerating in the IVRS/DMAR ACPI tables. And I see VFIO double-checks 
> > whether
> > PCI device is within the IOMMU address space (or bypassed) prior to DMA 
> > maps and
> > such.
> > 
> > You can see from the other email that all of the other options in my head 
> > were
> > either bit inconvenient or risky. I wasn't aware of this option for what is
> > worth -- much simpler, should work!
> >  
> 
> I say *should*, but on a second thought interrupt remapping may still be
> required to one of these devices that are IOMMU-bypassed. Say to put 
> affinities
> to vcpus above 255? I was trying this out with more than 255 vcpus with a 
> couple
> VFs and at a first glance these VFs fail to probe (these are CX6 VFs).
> 
> It is a working setup without the parameter, but now adding a
> default_bus_bypass_iommu=on fails to init VFs:
> 
> [   32.412733] mlx5_core 0000:00:02.0: Rate limit: 127 rates are supported,
> range: 0Mbps to 97656Mbps
> [   32.416242] mlx5_core 0000:00:02.0: mlx5_load:1204:(pid 3361): Failed to
> alloc IRQs
> [   33.227852] mlx5_core 0000:00:02.0: probe_one:1684:(pid 3361): 
> mlx5_init_one
> failed with error code -19
> [   33.242182] mlx5_core 0000:00:03.0: firmware version: 22.31.1660
> [   33.415876] mlx5_core 0000:00:03.0: Rate limit: 127 rates are supported,
> range: 0Mbps to 97656Mbps
> [   33.448016] mlx5_core 0000:00:03.0: mlx5_load:1204:(pid 3361): Failed to
> alloc IRQs
> [   34.207532] mlx5_core 0000:00:03.0: probe_one:1684:(pid 3361): 
> mlx5_init_one
> failed with error code -19
> 
> I haven't dived yet into why it fails.

Hmm, I was thinking this would only affect DMA, but on second thought
I think the DRHD also describes the interrupt remapping hardware and
while interrupt remapping is an optional feature of the DRHD, DMA
remapping is always supported afaict.  I saw IR vectors in
/proc/interrupts and thought it worked, but indeed an assigned device
is having trouble getting vectors.

> 
> > And avoiding vIOMMU simplifies the whole patchset too, if it's OK to add a 
> > live
> > migration blocker if `bypass_iommu` is off for any PCI device.
> >   
> 
> Still we could have for starters a live migration blocker until we revisit the
> vIOMMU case ... should we deem that the default_bus_bypass_iommu=on or the
> others I suggested as non-options?

I'm very uncomfortable presuming a vIOMMU usage model, especially when
it leads to potentially untracked DMA if our assumptions are violated.
We could use a MemoryListener on the IOVA space to record a high level
mark, but we'd need to continue to monitor that mark while we're in
pre-copy and I don't think anyone would agree that a migratable VM can
suddenly become unmigratable due to a random IOVA allocation would be
supportable.  That leads me to think that a machine option to limit the
vIOMMU address space, and testing that against the device prior to
declaring migration support of the device is possibly our best option.

Is that feasible?  Do all the vIOMMU models have a means to limit the
IOVA space?  How does QEMU learn a limit for a given device?  We
probably need to think about whether there are devices that can even
support the guest physical memory ranges when we start relocating RAM
to arbitrary addresses (ex. hypertransport).  Can we infer anything
from the vCPU virtual address space or is that still an unreasonable
range to track for devices?  Thanks,

Alex

Re: [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU

Reply via email to