On Thu, May 14, 2026 at 06:13:24PM +0000, Michael Kelley wrote: > From: Yu Zhang <[email protected]> Sent: Monday, May 11, 2026 9:24 > AM > > > > Add a para-virtualized IOMMU driver for Linux guests running on Hyper-V. > > This driver implements stage-1 IO translation within the guest OS. > > It integrates with the Linux IOMMU core, utilizing Hyper-V hypercalls > > for: > > - Capability discovery > > - Domain allocation, configuration, and deallocation > > - Device attachment and detachment > > - IOTLB invalidation > > > > The driver constructs x86-compatible stage-1 IO page tables in the > > guest memory using consolidated IO page table helpers. This allows > > the guest to manage stage-1 translations independently of vendor- > > specific drivers (like Intel VT-d or AMD IOMMU). > > > > Hyper-V consumes this stage-1 IO page table when a device domain is > > created and configured, and nests it with the host's stage-2 IO page > > tables, therefore eliminating the VM exits for guest IOMMU mapping > > operations. For unmapping operations, VM exits to perform the IOTLB > > flush are still unavoidable. > > > > Hyper-V identifies each PCI pass-thru device by a logical device ID > > in its hypercall interface. The vPCI driver (pci-hyperv) registers the > > per-bus portion of this ID with the pvIOMMU driver during bus probe. > > The pvIOMMU driver stores this mapping and combines it with the function > > number of the endpoint PCI device to form the complete ID for hypercalls. > > As you are probably aware, Mukesh's patch series to support PCI > pass-thru devices also needs to get the logical device ID. Maybe the > registration mechanism needs to move somewhere that can be shared > with his code. >
Thank you so much for the review, Michael! Yes, I looked at Mukesh's series and noticed his hv_pci_vmbus_device_id() in pci-hyperv.c has the same dev_instance byte manipulation. We do need a common registration mechanism. Any suggestion on where to put it? drivers/hv/hv_common.c seems like a natural place, but the register/lookup functions are currently only meaningful when CONFIG_HYPERV_PVIOMMU is set. If Mukesh's pass-thru code also needs them, we might need a new shared Kconfig option that both can select. Open to better ideas. Adding Mukesh to the loop. :) > > > > Co-developed-by: Wei Liu <[email protected]> > > Signed-off-by: Wei Liu <[email protected]> > > Co-developed-by: Easwar Hariharan <[email protected]> > > Signed-off-by: Easwar Hariharan <[email protected]> > > Signed-off-by: Yu Zhang <[email protected]> > > --- > > arch/x86/hyperv/hv_init.c | 4 + > > arch/x86/include/asm/mshyperv.h | 4 + > > drivers/iommu/hyperv/Kconfig | 17 + > > drivers/iommu/hyperv/Makefile | 1 + > > drivers/iommu/hyperv/iommu.c | 705 ++++++++++++++++++++++++++++ > > drivers/iommu/hyperv/iommu.h | 54 +++ > > drivers/pci/controller/pci-hyperv.c | 19 +- > > include/asm-generic/mshyperv.h | 12 + > > 8 files changed, 815 insertions(+), 1 deletion(-) > > create mode 100644 drivers/iommu/hyperv/iommu.c > > create mode 100644 drivers/iommu/hyperv/iommu.h > > > > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c > > index 323adc93f2dc..2c8ff8e06249 100644 > > --- a/arch/x86/hyperv/hv_init.c > > +++ b/arch/x86/hyperv/hv_init.c > > @@ -578,6 +578,10 @@ void __init hyperv_init(void) > > old_setup_percpu_clockev = x86_init.timers.setup_percpu_clockev; > > x86_init.timers.setup_percpu_clockev = hv_stimer_setup_percpu_clockev; > > > > +#ifdef CONFIG_HYPERV_PVIOMMU > > + x86_init.iommu.iommu_init = hv_iommu_init; > > +#endif > > + > > hv_apic_init(); > > > > x86_init.pci.arch_init = hv_pci_init; > > diff --git a/arch/x86/include/asm/mshyperv.h > > b/arch/x86/include/asm/mshyperv.h > > index f64393e853ee..20d947c2c758 100644 > > --- a/arch/x86/include/asm/mshyperv.h > > +++ b/arch/x86/include/asm/mshyperv.h > > @@ -313,6 +313,10 @@ static inline void mshv_vtl_return_hypercall(void) {} > > static inline void __mshv_vtl_return_call(struct mshv_vtl_cpu_context > > *vtl0) {} > > #endif > > > > +#ifdef CONFIG_HYPERV_PVIOMMU > > +int __init hv_iommu_init(void); > > +#endif > > + > > #include <asm-generic/mshyperv.h> > > > > #endif > > diff --git a/drivers/iommu/hyperv/Kconfig b/drivers/iommu/hyperv/Kconfig > > index 30f40d867036..9e658d5c9a77 100644 > > --- a/drivers/iommu/hyperv/Kconfig > > +++ b/drivers/iommu/hyperv/Kconfig > > @@ -8,3 +8,20 @@ config HYPERV_IOMMU > > help > > Stub IOMMU driver to handle IRQs to support Hyper-V Linux > > guest and root partitions. > > + > > +if HYPERV_IOMMU > > +config HYPERV_PVIOMMU > > + bool "Microsoft Hypervisor para-virtualized IOMMU support" > > + depends on X86 && HYPERV > > What is the intent w.r.t. 32-bit builds? Using X86 instead of X86_64 > allows it. I did a 32-bit build and didn't get any build failures, which is > good. But I can't run it to see if the pvIOMMU actually works in a > 32-bit build. I don't know how building X86_64 generic PT entries > would fare. > Sorry, no intent to support 32-bit. I'll change to `depends on X86_64 && HYPERV` [...] > > +static void hv_flush_device_domain(struct hv_iommu_domain *hv_domain) > > +{ > > + u64 status; > > + unsigned long flags; > > + struct hv_input_flush_device_domain *input; > > + > > + local_irq_save(flags); > > + > > + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > > + memset(input, 0, sizeof(*input)); > > + input->device_domain = hv_domain->device_domain; > > The previous version of this patch had code to set several other fields in > the input. I wanted to confirm that not setting them in this version is > intentional. Were they not needed? > Oh. The RFC v1 set partition_id, owner_vtl, domain_id.type, and domain_id.id individually. In this version, I just simplified it to a struct assignment. No functional change. > > + status = hv_do_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN, input, NULL); > > + > > + local_irq_restore(flags); > > + > > + if (!hv_result_success(status)) > > + pr_err("HVCALL_FLUSH_DEVICE_DOMAIN failed, status %lld\n", > > status); > > +} > > + > > +static void hv_iommu_detach_dev(struct iommu_domain *domain, struct device > > *dev) > > +{ > > + u64 status; > > + unsigned long flags; > > + struct hv_input_detach_device_domain *input; > > + struct pci_dev *pdev; > > + struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain); > > + struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev); > > + > > + /* See the attach function, only PCI devices for now */ > > + if (!dev_is_pci(dev) || vdev->hv_domain != hv_domain) > > + return; > > Are these sanity checks necessary? The only caller is hv_iommu_attach_dev() > and it has already done the checks. You're right, they're redundant. > > + > > + pdev = to_pci_dev(dev); > > + > > + dev_dbg(dev, "detaching from domain %d\n", > > hv_domain->device_domain.domain_id.id); > > + > > + local_irq_save(flags); > > + > > + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > > + memset(input, 0, sizeof(*input)); > > + input->partition_id = HV_PARTITION_ID_SELF; > > + if (hv_iommu_lookup_logical_dev_id(pdev, &input->device_id.as_uint64)) { > > As Sashiko and Jacob Pan pointed out, doing the lookup while interrupts are > disabled > is problematic. My suggestion would be to just do the lookup into a local > variable > before disabling interrupts (rather than using a raw spin lock as Jacob > suggested). > > Same situation occurs in hv_iommu_attach_dev() and > hv_iommu_get_logical_device_property(). > Thanks! I would also prefer to look up before disabling interrupts. > > + local_irq_restore(flags); > > + dev_warn(&pdev->dev, "no IOMMU registration for vPCI bus on > > detach\n"); > > + return; > > + } > > + status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input, NULL); > > + > > + local_irq_restore(flags); > > + > > + if (!hv_result_success(status)) > > + pr_err("HVCALL_DETACH_DEVICE_DOMAIN failed, status %lld\n", > > status); > > + > > + hv_flush_device_domain(hv_domain); > > + > > + vdev->hv_domain = NULL; > > +} > > + [...] > > +static int hv_iommu_get_logical_device_property(struct device *dev, > > + u32 code, > > + struct > > hv_output_get_logical_device_property *property) > > +{ > > + u64 status, lid; > > + unsigned long flags; > > + int ret; > > + struct hv_input_get_logical_device_property *input; > > + struct hv_output_get_logical_device_property *output; > > + > > + local_irq_save(flags); > > + > > + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > > + output = *this_cpu_ptr(hyperv_pcpu_input_arg) + sizeof(*input); > > Nit: The other way to set output is: > > output = input + 1; > > I think this produces slightly better code because of not needing to > reference the per-cpu variable hyperv_pcpu_input_arg a 2nd time. > > Indeed! It's more elegant. :) > > + memset(input, 0, sizeof(*input)); > > + input->partition_id = HV_PARTITION_ID_SELF; > > + ret = hv_iommu_lookup_logical_dev_id(to_pci_dev(dev), &lid); > > + if (ret) { > > + local_irq_restore(flags); > > + return ret; > > + } > > + input->logical_device_id = lid; > > + input->code = code; > > + status = hv_do_hypercall(HVCALL_GET_LOGICAL_DEVICE_PROPERTY, input, > > output); > > + *property = *output; > > + > > + local_irq_restore(flags); > > + > > + if (!hv_result_success(status)) > > + pr_err("HVCALL_GET_LOGICAL_DEVICE_PROPERTY failed, status > > %lld\n", status); > > + > > + return hv_result_to_errno(status); > > +} > > + [...] > > +static void hv_iommu_release_device(struct device *dev) > > +{ > > + struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev); > > + struct pci_dev *pdev = to_pci_dev(dev); > > + > > + if (pdev->ats_enabled) > > + pci_disable_ats(pdev); > > + > > + dev_iommu_priv_set(dev, NULL); > > + set_dma_ops(dev, NULL); > > Previous versions of this function did hv_iommu_detach_dev(). With that call > removed from here, hv_iommu_detach_dev() is only called when attaching a > domain to a device that already has a domain attached. Is it the case that > Hyper-V doesn't require the detach as a cleanup step? > The IOMMU core attaches the device to release_domain (our blocking domain) before calling release_device(), so I believe the explicit detach in the RFC was redundant. I simply didn't realize that at the time. Sorry I forgot to mention this in the changelog. [...] > > +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device > > *dev) > > +{ > > + int ret; > > + struct hv_iommu_domain *hv_domain; > > + struct pt_iommu_x86_64_cfg cfg = {}; > > + > > + hv_domain = kzalloc_obj(*hv_domain, GFP_KERNEL); > > + if (!hv_domain) > > + return ERR_PTR(-ENOMEM); > > + > > + ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1); > > + if (ret) { > > + kfree(hv_domain); > > + return ERR_PTR(ret); > > + } > > + > > + hv_domain->domain.geometry = hv_iommu_device->geometry; > > + hv_domain->pt_iommu.nid = dev_to_node(dev); > > + > > + cfg.common.hw_max_vasz_lg2 = hv_iommu_device->max_iova_width; > > + cfg.common.hw_max_oasz_lg2 = 52; > > + cfg.top_level = (hv_iommu_device->max_iova_width > 48) ? 4 : 3; > > + > > + ret = pt_iommu_x86_64_init(&hv_domain->pt_iommu_x86_64, &cfg, > > GFP_KERNEL); > > + if (ret) { > > + hv_delete_device_domain(hv_domain); > > + kfree(hv_domain); > > + return ERR_PTR(ret); > > + } > > + > > + /* Constrain to page sizes the hypervisor supports */ > > + hv_domain->domain.pgsize_bitmap &= hv_iommu_device->pgsize_bitmap; > > + > > + hv_domain->domain.ops = &hv_iommu_paging_domain_ops; > > + > > + ret = hv_configure_device_domain(hv_domain, __IOMMU_DOMAIN_PAGING); > > + if (ret) { > > + pt_iommu_deinit(&hv_domain->pt_iommu); > > + hv_delete_device_domain(hv_domain); > > + kfree(hv_domain); > > + return ERR_PTR(ret); > > + } > > + > > + return &hv_domain->domain; > > I think this function would be better if the error paths did "goto" > a cascading set of error labels. That's the typical pattern, and it's what you > use in hv_iommu_init(), for example. > Good point. Will restructure to use goto-based error labels > > +} > > + > > +static struct iommu_ops hv_iommu_ops = { > > + .capable = hv_iommu_capable, > > + .domain_alloc_paging = hv_iommu_domain_alloc_paging, > > + .probe_device = hv_iommu_probe_device, > > + .release_device = hv_iommu_release_device, > > + .device_group = hv_iommu_device_group, > > + .get_resv_regions = hv_iommu_get_resv_regions, > > + .owner = THIS_MODULE, > > + .identity_domain = &hv_identity_domain.domain, > > + .blocked_domain = &hv_blocking_domain.domain, > > + .release_domain = &hv_blocking_domain.domain, > > +}; > > + > > +static int hv_iommu_detect(struct hv_output_get_iommu_capabilities > > *hv_iommu_cap) > > +{ > > + u64 status; > > + unsigned long flags; > > + struct hv_input_get_iommu_capabilities *input; > > + struct hv_output_get_iommu_capabilities *output; > > + > > + local_irq_save(flags); > > + > > + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > > + output = *this_cpu_ptr(hyperv_pcpu_input_arg) + sizeof(*input); > > Potentially use "output = input + 1" here as well. > Yes. Thanks! [...] > > @@ -3857,13 +3858,25 @@ static int hv_pci_probe(struct hv_device *hdev, > > > > hbus->state = hv_pcibus_probed; > > > > - ret = create_root_hv_pci_bus(hbus); > > + /* Notify pvIOMMU before any device on the bus is scanned. */ > > + prefix = (hdev->dev_instance.b[5] << 24) | > > + (hdev->dev_instance.b[4] << 16) | > > + (hdev->dev_instance.b[7] << 8) | > > + (hdev->dev_instance.b[6] & 0xf8); > > This assembling of the logical device id prefix duplicates the > code in hv_irq_retarget_interrupt(). Could this code save the > prefix in struct hv_pcibus_device, and then have > hv_irq_retarget_interrupt() use it? Then it would be clear > that HVCALL_RETARGET_INTERRUPT is using exactly the same > logical device id as the IOMMU hypercalls. > Good point. I think we can do it. :) > > + > > + ret = hv_iommu_register_pci_bus(dom, prefix); > > if (ret) > > goto free_windows; > > > > + ret = create_root_hv_pci_bus(hbus); > > + if (ret) > > + goto unregister_pviommu; > > + > > mutex_unlock(&hbus->state_lock); > > return 0; > > > > +unregister_pviommu: > > + hv_iommu_unregister_pci_bus(dom); > > free_windows: > > hv_pci_free_bridge_windows(hbus); > > exit_d0: > > @@ -3974,8 +3987,10 @@ static int hv_pci_bus_exit(struct hv_device *hdev, > > bool > > keep_devs) > > static void hv_pci_remove(struct hv_device *hdev) > > { > > struct hv_pcibus_device *hbus; > > + int dom; > > > > hbus = hv_get_drvdata(hdev); > > + dom = hbus->bridge->domain_nr; > > Nit: Setting "dom" here feels a little weird because the value is only needed > under the "if" statement. The value must be read before the root bus is > removed, but even so moving it under the "if" statement would make more > sense to me. > Sure. Thanks! > > if (hbus->state == hv_pcibus_installed) { > > tasklet_disable(&hdev->channel->callback_event); > > hbus->state = hv_pcibus_removing; > > @@ -3994,6 +4009,8 @@ static void hv_pci_remove(struct hv_device *hdev) > > hv_pci_remove_slots(hbus); > > pci_remove_root_bus(hbus->bridge->bus); > > pci_unlock_rescan_remove(); > > + > > + hv_iommu_unregister_pci_bus(dom); > > } > > > > hv_pci_bus_exit(hdev, false); B.R. Yu

