On Fri, May 15, 2026 at 07:31:39PM -0300, Jason Gunthorpe wrote:
> On Tue, May 12, 2026 at 12:24:07AM +0800, Yu Zhang wrote:
> > +/*
> > + * Identity and blocking domains are static singletons: identity is a 1:1
> > + * passthrough with no page table, blocking rejects all DMA. Neither holds
> > + * per-IOMMU state, so one instance suffices even with multiple vIOMMUs.
> > + */
> > +static struct hv_iommu_domain hv_identity_domain;
> > +static struct hv_iommu_domain hv_blocking_domain;
> > +static const struct iommu_domain_ops hv_iommu_identity_domain_ops;
> > +static const struct iommu_domain_ops hv_iommu_blocking_domain_ops;
>
> Please follow the format of other drivers and statically initialize
> these here instead of in C code.
>
Thanks! Will do.
> > +static struct iommu_ops hv_iommu_ops;
> > +static LIST_HEAD(hv_iommu_pci_bus_list);
> > +static DEFINE_SPINLOCK(hv_iommu_pci_bus_lock);
> > +
> > +#define hv_iommu_present(iommu_cap) (iommu_cap & HV_IOMMU_CAP_PRESENT)
> > +#define hv_iommu_s1_domain_supported(iommu_cap) (iommu_cap &
> > HV_IOMMU_CAP_S1)
> > +#define hv_iommu_5lvl_supported(iommu_cap) (iommu_cap &
> > HV_IOMMU_CAP_S1_5LVL)
> > +#define hv_iommu_ats_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_ATS)
>
> prefer to see static inlines
>
Yes. Thanks.
> > +static void hv_iommu_detach_dev(struct iommu_domain *domain, struct device
> > *dev)
> > +{
> > + u64 status;
> > + unsigned long flags;
> > + struct hv_input_detach_device_domain *input;
> > + struct pci_dev *pdev;
> > + struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain);
> > + struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> > +
> > + /* See the attach function, only PCI devices for now */
> > + if (!dev_is_pci(dev) || vdev->hv_domain != hv_domain)
> > + return;
> > +
> > + pdev = to_pci_dev(dev);
> > +
> > + dev_dbg(dev, "detaching from domain %d\n",
> > hv_domain->device_domain.domain_id.id);
> > +
> > + local_irq_save(flags);
> > +
> > + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > + memset(input, 0, sizeof(*input));
> > + input->partition_id = HV_PARTITION_ID_SELF;
> > + if (hv_iommu_lookup_logical_dev_id(pdev, &input->device_id.as_uint64)) {
> > + local_irq_restore(flags);
> > + dev_warn(&pdev->dev, "no IOMMU registration for vPCI bus on
> > detach\n");
> > + return;
> > + }
> > + status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input, NULL);
>
> FWIW the hypervisor cannot implement the linux attach semantics if it
> has detach/attach. It must support simply 'attach' which atomically
> changes the translation. With detach you have a confusing problem if
> errors happen in the middle of the sequence the device is left in an
> unclear state. You should at least document what state the hypervisor
> is supposed to leaave the translation iin during these failures..
>
Thanks for raising this!
My understanding is that the detach hypercall just clears the HW
IOMMU entry for the device (e.g., context table entry on VT-d, DTE
on AMD IOMMU) and flushes the IOTLB. So after detach and before the
subsequent attach, DMA from the device is blocked by the HW IOMMU
- the device is not in an unprotected state.
The detach shall not fail unless the pvIOMMU driver passes incorrect
parameters or is otherwise in a buggy state, or Hyper-V itself has
a bug.
But making attach atomic might be a cleaner approach (drop the
explicit detach and let the hypervisor handle the domain switch
internally)? I'll look into making that work, though it needs
verification (and possibly changes) on the hypervisor side (these
hypercalls are not exclusively for Linux guest). If that doesn't
work out, I'll keep the two-step approach with comments documenting
the intermediate translation state.
> > +static void hv_iommu_release_device(struct device *dev)
> > +{
> > + struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> > + struct pci_dev *pdev = to_pci_dev(dev);
> > +
> > + if (pdev->ats_enabled)
> > + pci_disable_ats(pdev);
> > +
> > + dev_iommu_priv_set(dev, NULL);
> > + set_dma_ops(dev, NULL);
>
> Does the driver really need to mess with set_dma_ops ? I'd rather not
> see that in new iommu drivers if at all possible :|
>
No. I should have removed it. Thanks!
> > +static int __init hv_initialize_static_domains(void)
> > +{
> > + int ret;
> > + struct hv_iommu_domain *hv_domain;
> > +
> > + /* Default stage-1 identity domain */
> > + hv_domain = &hv_identity_domain;
> > +
> > + ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
> > + if (ret)
> > + return ret;
> > +
> > + ret = hv_configure_device_domain(hv_domain, IOMMU_DOMAIN_IDENTITY);
> > + if (ret)
> > + goto delete_identity_domain;
> > +
> > + hv_domain->domain.type = IOMMU_DOMAIN_IDENTITY;
> > + hv_domain->domain.ops = &hv_iommu_identity_domain_ops;
> > + hv_domain->domain.owner = &hv_iommu_ops;
> > + hv_domain->domain.geometry = hv_iommu_device->geometry;
> > + hv_domain->domain.pgsize_bitmap = hv_iommu_device->pgsize_bitmap;
>
> identity doesn't use geometry or pgsize_bitmap
>
Sure. Will remove it.
> > + /* Default stage-1 blocked domain */
> > + hv_domain = &hv_blocking_domain;
> > +
> > + ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
> > + if (ret)
> > + goto delete_identity_domain;
> > +
> > + ret = hv_configure_device_domain(hv_domain, IOMMU_DOMAIN_BLOCKED);
> > + if (ret)
> > + goto delete_blocked_domain;
> > +
> > + hv_domain->domain.type = IOMMU_DOMAIN_BLOCKED;
> > + hv_domain->domain.ops = &hv_iommu_blocking_domain_ops;
> > + hv_domain->domain.owner = &hv_iommu_ops;
> > + hv_domain->domain.geometry = hv_iommu_device->geometry;
> > + hv_domain->domain.pgsize_bitmap = hv_iommu_device->pgsize_bitmap;
>
> Nor does blocked
>
Same. Will remove.
> > +#define INTERRUPT_RANGE_START (0xfee00000)
> > +#define INTERRUPT_RANGE_END (0xfeefffff)
> > +static void hv_iommu_get_resv_regions(struct device *dev,
> > + struct list_head *head)
> > +{
> > + struct iommu_resv_region *region;
> > +
> > + region = iommu_alloc_resv_region(INTERRUPT_RANGE_START,
> > + INTERRUPT_RANGE_END -
> > INTERRUPT_RANGE_START + 1,
> > + 0, IOMMU_RESV_MSI, GFP_KERNEL);
>
> Surprised these constants are not discovered from the hv?
>
Thanks. Since the pvIOMMU currently only targets x86, hardcoding
the x86 architectural MSI range should be reasonable?
If in the future we need additional reserved regions (e.g.,
RMRR-style identity-mapped ranges), those would be discovered
from the hypervisor.
> > +static void hv_iommu_flush_iotlb_all(struct iommu_domain *domain)
> > +{
> > + hv_flush_device_domain(to_hv_iommu_domain(domain));
> > +}
> > +
> > +static void hv_iommu_iotlb_sync(struct iommu_domain *domain,
> > + struct iommu_iotlb_gather *iotlb_gather)
> > +{
> > + hv_flush_device_domain(to_hv_iommu_domain(domain));
> > +
> > + iommu_put_pages_list(&iotlb_gather->freelist);
> > +}
>
> Full invalidation only huh?
>
Page-selective IOTLB flush is added in patch 4/4.
> > +static const struct iommu_domain_ops hv_iommu_identity_domain_ops = {
> > + .attach_dev = hv_iommu_attach_dev,
> > +};
> > +
> > +static const struct iommu_domain_ops hv_iommu_blocking_domain_ops = {
> > + .attach_dev = hv_iommu_attach_dev,
> > +};
>
> Usually I would expect these to have their own attach
> functions. blocking in particular must have an attach op that cannot
> fail. It is used to recover the device back to a known translation in
> case of cascading other errors.
>
For blocking domain, the hypercall handler of such attach essentially
disable the translation and IOPF for the device. It should not fail
unless there is a bug in the driver or the hypervisor.
How about something like this?
static int hv_blocking_attach_dev(domain, dev, old)
{
...
status = hv_do_hypercall(HVCALL_ATTACH, blocking, dev);
WARN_ON(!hv_result_success(status));
vdev->hv_domain = blocking;
return 0;
}
For identity domain, would it be fine to keep sharing the same attach
callback with paging domain?
> > +static const struct iommu_domain_ops hv_iommu_paging_domain_ops = {
> > + .attach_dev = hv_iommu_attach_dev,
> > + IOMMU_PT_DOMAIN_OPS(x86_64),
> > + .flush_iotlb_all = hv_iommu_flush_iotlb_all,
> > + .iotlb_sync = hv_iommu_iotlb_sync,
> > + .free = hv_iommu_paging_domain_free,
> > +};
> > +
> > +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device
> > *dev)
> > +{
> > + int ret;
> > + struct hv_iommu_domain *hv_domain;
> > + struct pt_iommu_x86_64_cfg cfg = {};
> > +
> > + hv_domain = kzalloc_obj(*hv_domain, GFP_KERNEL);
> > + if (!hv_domain)
> > + return ERR_PTR(-ENOMEM);
> > +
> > + ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
> > + if (ret) {
> > + kfree(hv_domain);
> > + return ERR_PTR(ret);
> > + }
> > +
> > + hv_domain->domain.geometry = hv_iommu_device->geometry;
>
> geoemtry shouldn't be set here, it is overriden by
> pt_iommu_x86_64_init() with the exact page table configuration
>
Right. Will remove.
> > +static void __init hv_init_iommu_device(struct hv_iommu_dev *hv_iommu,
> > + struct hv_output_get_iommu_capabilities *hv_iommu_cap)
> > +{
> > + ida_init(&hv_iommu->domain_ids);
> > +
> > + hv_iommu->cap = hv_iommu_cap->iommu_cap;
> > + hv_iommu->max_iova_width = hv_iommu_cap->max_iova_width;
> > + if (!hv_iommu_5lvl_supported(hv_iommu->cap) &&
> > + hv_iommu->max_iova_width > 48) {
> > + pr_info("5-level paging not supported, limiting iova width to
> > 48.\n");
> > + hv_iommu->max_iova_width = 48;
> > + }
> > +
> > + hv_iommu->geometry = (struct iommu_domain_geometry) {
> > + .aperture_start = 0,
> > + .aperture_end = (((u64)1) << hv_iommu->max_iova_width) - 1,
> > + .force_aperture = true,
> > + };
> > +
> > + hv_iommu->first_domain = HV_DEVICE_DOMAIN_ID_DEFAULT + 1;
> > + hv_iommu->last_domain = HV_DEVICE_DOMAIN_ID_NULL - 1;
> > + /* Only x86 page sizes (4K/2M/1G) are supported */
> > + hv_iommu->pgsize_bitmap = hv_iommu_cap->pgsize_bitmap &
> > + (SZ_4K | SZ_2M | SZ_1G);
> > + if (hv_iommu->pgsize_bitmap != hv_iommu_cap->pgsize_bitmap)
> > + pr_warn("unsupported page sizes masked: 0x%llx -> 0x%llx\n",
> > + hv_iommu_cap->pgsize_bitmap, hv_iommu->pgsize_bitmap);
>
> IKD if you need this logic, the way the page table code is used it
> just sort of falls out naturally that other page sizes are ignored.
>
Right, the page table code can handle 4K/2M/1G naturally. We'd like
hypervisor to be able to choose to support only a subset of {4K, 2M,
1G}. This masking filters out any unexpected page sizes and falls
back to 4K if nothing valid remains.
The if statement may be a bit paranoid perhaps, but it felt safer
to validate explicitly. Happy to drop it if you think it's unnecessary.
> > +struct hv_iommu_domain {
> > + union {
> > + struct iommu_domain domain;
> > + struct pt_iommu pt_iommu;
> > + struct pt_iommu_x86_64 pt_iommu_x86_64;
> > + };
>
> You should retain the build assertions here
>
Sure. Thanks!
B.R.
Yu
> Jason
>