On Thu, May 14, 2026 at 06:14:22PM +0000, Michael Kelley wrote:
> From: Yu Zhang <[email protected]> Sent: Monday, May 11, 2026 9:24
> AM
> >
> > Add page-selective IOTLB flush using HVCALL_FLUSH_DEVICE_DOMAIN_LIST.
> > This hypercall accepts a list of (page_number, page_mask_shift) entries,
> > enabling finer-grained IOTLB invalidation compared to the domain-wide
> > HVCALL_FLUSH_DEVICE_DOMAIN used by hv_iommu_flush_iotlb_all().
> >
> > hv_iommu_fill_iova_list() decomposes a contiguous IOVA range into a
> > minimal set of aligned power-of-two regions that fit in a single
> > hypercall input page. When the range exceeds the page capacity, the
> > code falls back to a full domain flush automatically.
> >
> > Signed-off-by: Yu Zhang <[email protected]>
> > Signed-off-by: Easwar Hariharan <[email protected]>
> > ---
> > drivers/iommu/hyperv/iommu.c | 91 +++++++++++++++++++++++++++++++++++-
> > include/hyperv/hvgdk_mini.h | 1 +
> > include/hyperv/hvhdk_mini.h | 17 +++++++
> > 3 files changed, 108 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/hyperv/iommu.c b/drivers/iommu/hyperv/iommu.c
> > index e5fc625314b5..3bca362b7815 100644
> > --- a/drivers/iommu/hyperv/iommu.c
> > +++ b/drivers/iommu/hyperv/iommu.c
> > @@ -486,10 +486,98 @@ static void hv_iommu_flush_iotlb_all(struct
> > iommu_domain *domain)
> > hv_flush_device_domain(to_hv_iommu_domain(domain));
> > }
> >
> > +/* Max number of iova_list entries in a single hypercall input page. */
> > +#define HV_IOMMU_MAX_FLUSH_VA_COUNT \
> > + ((HV_HYP_PAGE_SIZE - sizeof(struct hv_input_flush_device_domain_list))
> > / \
> > + sizeof(union hv_iommu_flush_va))
> > +
> > +/* Returned by hv_iommu_fill_iova_list() when the range exceeds the
> > capacity */
> > +#define HV_IOMMU_FLUSH_VA_OVERFLOW U16_MAX
> > +
> > +static inline u16 hv_iommu_fill_iova_list(union hv_iommu_flush_va
> > *iova_list,
> > + unsigned long start,
> > + unsigned long end)
> > +{
> > + unsigned long start_pfn = start >> PAGE_SHIFT;
> > + unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
>
> "end" is an inclusive end address per comment in struct iommu_iotlb_gather.
> So a page aligned value would typically have 0xFFF as the low order 12 bits,
> and PAGE_ALIGN() will do the right thing. But I don't think the value is
> *required* to be page aligned. If the value of "end" had 0x000 as the
> low order 12 bits, the above calculation would fail to include the page
> that has the address ending in 0x000. I think it needs to be
> PAGE_ALIGN(end + 1) in order to work correctly for this corner case.
>
Good catch! Will use HVPFN_DOWN(start) and HVPFN_UP(end + 1) as you
suggested in your follow-up mail.
> > + unsigned long nr_pages = end_pfn - start_pfn;
> > + u16 count = 0;
> > +
> > + while (nr_pages > 0) {
> > + unsigned long flush_pages;
> > + int order;
> > + unsigned long pfn_align;
> > + unsigned long size_align;
> > +
> > + if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > + count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > + break;
> > + }
> > +
> > + if (start_pfn)
> > + pfn_align = __ffs(start_pfn);
>
> I don't understand why __ffs() is correct here. I would expect
> __fls() so it is consistent with the calculation of size_align. But I
> can only surmise how the hypercall works since there's no
> documentation, so maybe my understanding of the hypercall is
> wrong. If __ffs really is correct, a comment explaining why
> would help. :-)
>
The use of __ffs() is intentional. Each flush entry invalidates a
naturally aligned 2^N page block, and the hypervisor requires the
page_number to be aligned to 2^page_mask_shift.
Here __ffs() and __fls() serve different purposes:
- __ffs(start_pfn) is about the alignment constraint, e.g., how
large a block can this address support?
- __fls(nr_pages) is about the size constraint, e.g., how large
a block can the remaining range hold?
Taking min() of both ensures each entry is both properly aligned
and within bounds.
Thanks for raising this — it definitely deserves a comment. I had to
stare at it for a while myself to remember why. :)
> > + else
> > + pfn_align = BITS_PER_LONG - 1;
> > +
> > + size_align = __fls(nr_pages);
> > + order = min(pfn_align, size_align);
> > + iova_list[count].page_mask_shift = order;
> > + iova_list[count].page_number = start_pfn;
> > +
> > + flush_pages = 1UL << order;
> > + start_pfn += flush_pages;
> > + nr_pages -= flush_pages;
> > + count++;
> > + }
> > +
> > + return count;
> > +}
> > +
> > +static void hv_flush_device_domain_list(struct hv_iommu_domain *hv_domain,
> > + struct iommu_iotlb_gather *iotlb_gather)
> > +{
> > + u64 status;
> > + u16 count;
> > + unsigned long flags;
> > + struct hv_input_flush_device_domain_list *input;
> > +
> > + local_irq_save(flags);
> > +
> > + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > + memset(input, 0, sizeof(*input));
> > +
> > + input->device_domain = hv_domain->device_domain;
> > + input->flags |= HV_FLUSH_DEVICE_DOMAIN_LIST_IOMMU_FORMAT;
>
> I would suggest moving the memset() and setting the input fields down
> under the "else" below so that they are parallel with the flush all case.
>
I agree the structure should be more symmetric. Yet I guess the memset and
hv_iommu_fill_iova_list() need to stay before the branch since the fill
writes directly into input->iova_list[]. :)
> > + count = hv_iommu_fill_iova_list(input->iova_list,
> > + iotlb_gather->start,
> > + iotlb_gather->end);
> > + if (count == HV_IOMMU_FLUSH_VA_OVERFLOW) {
> > + /*
> > + * Range exceeds hypercall page capacity. Fall back to a full
> > + * domain flush.
> > + */
> > + struct hv_input_flush_device_domain *flush_all = (void *)input;
> > +
> > + memset(flush_all, 0, sizeof(*flush_all));
> > + flush_all->device_domain = hv_domain->device_domain;
> > + status = hv_do_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN,
> > + flush_all, NULL);
> > + } else {
> > + status = hv_do_rep_hypercall(
> > + HVCALL_FLUSH_DEVICE_DOMAIN_LIST,
> > + count, 0, input, NULL);
> > + }
> > +
> > + local_irq_restore(flags);
> > +
> > + if (!hv_result_success(status))
> > + pr_err("HVCALL_FLUSH_DEVICE_DOMAIN_LIST failed, status %lld\n",
> > status);
>
> As Sashiko pointed out, a failure here can lead to all kinds of trouble
> because
> of leaving unflushed entries. Maybe a WARN() is more appropriate? Also, maybe
> a failure in the list flush should try a flush all as a fallback, with the
> WARN()
> only if the flush all fails.
>
Good idea. How about we restructure this routine to sth. like this:
memset(input, 0, sizeof(*input));
count = hv_iommu_fill_iova_list(...);
if (count != HV_IOMMU_FLUSH_VA_OVERFLOW) {
input->device_domain = ...;
...
status = hv_do_rep_hypercall(FLUSH_DEVICE_DOMAIN_LIST, ...);
if (hv_result_success(status))
goto out;
}
/* overflow or list flush failed: fallback to full domain flush */
flush_all = (void *)input;
memset(flush_all, 0, sizeof(*flush_all));
flush_all->device_domain = ...;
status = hv_do_hypercall(FLUSH_DEVICE_DOMAIN, ...);
WARN(!hv_result_success(status), "IOTLB flush failed, status %lld\n",
status);
out:
local_irq_restore(flags);
B.R.
Yu