On Fri, 24 Feb 2023 12:53:26 +0000 Joao Martins <joao.m.mart...@oracle.com> wrote:
> On 24/02/2023 11:25, Joao Martins wrote: > > On 23/02/2023 23:26, Jason Gunthorpe wrote: > >> On Thu, Feb 23, 2023 at 03:33:09PM -0700, Alex Williamson wrote: > >>> On Thu, 23 Feb 2023 16:55:54 -0400 > >>> Jason Gunthorpe <j...@nvidia.com> wrote: > >>>> On Thu, Feb 23, 2023 at 01:06:33PM -0700, Alex Williamson wrote: > >>>> Or even better figure out how to get interrupt remapping without IOMMU > >>>> support :\ > >>> > >>> -machine q35,default_bus_bypass_iommu=on,kernel-irqchip=split \ > >>> -device intel-iommu,caching-mode=on,intremap=on > >> > >> Joao? > >> > >> If this works lets just block migration if the vIOMMU is turned on.. > > > > At a first glance, this looked like my regular iommu incantation. > > > > But reading the code this ::bypass_iommu (new to me) apparently tells that > > vIOMMU is bypassed or not for the PCI devices all the way to avoiding > > enumerating in the IVRS/DMAR ACPI tables. And I see VFIO double-checks > > whether > > PCI device is within the IOMMU address space (or bypassed) prior to DMA > > maps and > > such. > > > > You can see from the other email that all of the other options in my head > > were > > either bit inconvenient or risky. I wasn't aware of this option for what is > > worth -- much simpler, should work! > > > > I say *should*, but on a second thought interrupt remapping may still be > required to one of these devices that are IOMMU-bypassed. Say to put > affinities > to vcpus above 255? I was trying this out with more than 255 vcpus with a > couple > VFs and at a first glance these VFs fail to probe (these are CX6 VFs). > > It is a working setup without the parameter, but now adding a > default_bus_bypass_iommu=on fails to init VFs: > > [ 32.412733] mlx5_core 0000:00:02.0: Rate limit: 127 rates are supported, > range: 0Mbps to 97656Mbps > [ 32.416242] mlx5_core 0000:00:02.0: mlx5_load:1204:(pid 3361): Failed to > alloc IRQs > [ 33.227852] mlx5_core 0000:00:02.0: probe_one:1684:(pid 3361): > mlx5_init_one > failed with error code -19 > [ 33.242182] mlx5_core 0000:00:03.0: firmware version: 22.31.1660 > [ 33.415876] mlx5_core 0000:00:03.0: Rate limit: 127 rates are supported, > range: 0Mbps to 97656Mbps > [ 33.448016] mlx5_core 0000:00:03.0: mlx5_load:1204:(pid 3361): Failed to > alloc IRQs > [ 34.207532] mlx5_core 0000:00:03.0: probe_one:1684:(pid 3361): > mlx5_init_one > failed with error code -19 > > I haven't dived yet into why it fails. Hmm, I was thinking this would only affect DMA, but on second thought I think the DRHD also describes the interrupt remapping hardware and while interrupt remapping is an optional feature of the DRHD, DMA remapping is always supported afaict. I saw IR vectors in /proc/interrupts and thought it worked, but indeed an assigned device is having trouble getting vectors. > > > And avoiding vIOMMU simplifies the whole patchset too, if it's OK to add a > > live > > migration blocker if `bypass_iommu` is off for any PCI device. > > > > Still we could have for starters a live migration blocker until we revisit the > vIOMMU case ... should we deem that the default_bus_bypass_iommu=on or the > others I suggested as non-options? I'm very uncomfortable presuming a vIOMMU usage model, especially when it leads to potentially untracked DMA if our assumptions are violated. We could use a MemoryListener on the IOVA space to record a high level mark, but we'd need to continue to monitor that mark while we're in pre-copy and I don't think anyone would agree that a migratable VM can suddenly become unmigratable due to a random IOVA allocation would be supportable. That leads me to think that a machine option to limit the vIOMMU address space, and testing that against the device prior to declaring migration support of the device is possibly our best option. Is that feasible? Do all the vIOMMU models have a means to limit the IOVA space? How does QEMU learn a limit for a given device? We probably need to think about whether there are devices that can even support the guest physical memory ranges when we start relocating RAM to arbitrary addresses (ex. hypertransport). Can we infer anything from the vCPU virtual address space or is that still an unreasonable range to track for devices? Thanks, Alex