From: Yu Zhang <[email protected]> Sent: Friday, May 15, 2026 9:24 AM
> 
> On Thu, May 14, 2026 at 06:14:22PM +0000, Michael Kelley wrote:
> > From: Yu Zhang <[email protected]> Sent: Monday, May 11, 2026 
> > 9:24 AM
> > >

[....]
 
> > > + unsigned long nr_pages = end_pfn - start_pfn;
> > > + u16 count = 0;
> > > +
> > > + while (nr_pages > 0) {
> > > +         unsigned long flush_pages;
> > > +         int order;
> > > +         unsigned long pfn_align;
> > > +         unsigned long size_align;
> > > +
> > > +         if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > +                 count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > +                 break;
> > > +         }
> > > +
> > > +         if (start_pfn)
> > > +                 pfn_align = __ffs(start_pfn);
> >
> > I don't understand why __ffs() is correct here. I would expect
> > __fls() so it is consistent with the calculation of size_align. But I
> > can only surmise how the hypercall works since there's no
> > documentation, so maybe my understanding of the hypercall is
> > wrong.   If __ffs really is correct, a comment explaining why
> > would help. :-)
> >
> 
> The use of __ffs() is intentional. Each flush entry invalidates a
> naturally aligned 2^N page block, and the hypervisor requires the
> page_number to be aligned to 2^page_mask_shift.
> 
> Here __ffs() and __fls() serve different purposes:
> - __ffs(start_pfn) is about the alignment constraint, e.g.,  how
> large a block can this address support?
> - __fls(nr_pages) is about  the size constraint, e.g., how large
> a block can the remaining range hold?
> 
> Taking min() of both ensures each entry is both properly aligned
> and within bounds.
> 
> Thanks for raising this - it definitely deserves a comment. I had to
> stare at it for a while myself to remember why. :)

Hmmm. Something about this still nags at me. I'll run some
experiments to either convince myself that you are right, or to
come up with a counterexample.

A related thought occurred to me. If each flush entry that is passed
to Hyper-V describes a naturally aligned 2^N page block, I don't
think the HV_IOMMU_MAX_FLUSH_VA_COUNT can ever
be reached. The number of entries is limited by the number of
bits in a PFN and the pages count, both of which are 64. And with
52 bit physical addressing and 4KiB pages, the actual size of
a PFN and pages count is even smaller than 64. 
HV_IOMMU_MAX_FLUSH_VA_COUNT is the number of 8 byte
union hv_iommu_flush_va entries that fit in a 4KiB page, so
it's ~500.

My statement applies to a single flush range. If multiple flush
ranges were strung together in a single hypercall, a larger count
could be reached, but hv_flush_device_domain_list() only does
a single range. So I don't think the overflow case in
hv_flush_device_domain_list() can ever happen. But let me
do my experiments, and I will also look at this aspect to confirm
if it's right.

> 
> > > +         else
> > > +                 pfn_align = BITS_PER_LONG - 1;
> > > +
> > > +         size_align = __fls(nr_pages);
> > > +         order = min(pfn_align, size_align);
> > > +         iova_list[count].page_mask_shift = order;
> > > +         iova_list[count].page_number = start_pfn;
> > > +
> > > +         flush_pages = 1UL << order;
> > > +         start_pfn += flush_pages;
> > > +         nr_pages -= flush_pages;
> > > +         count++;
> > > + }
> > > +
> > > + return count;
> > > +}
> > > +
> > > +static void hv_flush_device_domain_list(struct hv_iommu_domain 
> > > *hv_domain,
> > > +                                 struct iommu_iotlb_gather *iotlb_gather)
> > > +{
> > > + u64 status;
> > > + u16 count;
> > > + unsigned long flags;
> > > + struct hv_input_flush_device_domain_list *input;
> > > +
> > > + local_irq_save(flags);
> > > +
> > > + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > > + memset(input, 0, sizeof(*input));
> > > +
> > > + input->device_domain = hv_domain->device_domain;
> > > + input->flags |= HV_FLUSH_DEVICE_DOMAIN_LIST_IOMMU_FORMAT;
> >
> > I would suggest moving the memset() and setting the input fields down
> > under the "else" below so that they are parallel with the flush all case.
> >
> 
> I agree the structure should be more symmetric. Yet I guess the memset and
> hv_iommu_fill_iova_list() need to stay before the branch since the fill
> writes directly into input->iova_list[]. :)

Agreed.

> 
> > > + count = hv_iommu_fill_iova_list(input->iova_list,
> > > +                                 iotlb_gather->start,
> > > +                                 iotlb_gather->end);
> > > + if (count == HV_IOMMU_FLUSH_VA_OVERFLOW) {
> > > +         /*
> > > +          * Range exceeds hypercall page capacity. Fall back to a full
> > > +          * domain flush.
> > > +          */
> > > +         struct hv_input_flush_device_domain *flush_all = (void *)input;
> > > +
> > > +         memset(flush_all, 0, sizeof(*flush_all));
> > > +         flush_all->device_domain = hv_domain->device_domain;
> > > +         status = hv_do_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN,
> > > +                                 flush_all, NULL);
> > > + } else {
> > > +         status = hv_do_rep_hypercall(
> > > +                         HVCALL_FLUSH_DEVICE_DOMAIN_LIST,
> > > +                         count, 0, input, NULL);
> > > + }
> > > +
> > > + local_irq_restore(flags);
> > > +
> > > + if (!hv_result_success(status))
> > > +         pr_err("HVCALL_FLUSH_DEVICE_DOMAIN_LIST failed, status %lld\n", 
> > > status);
> >
> > As Sashiko pointed out, a failure here can lead to all kinds of trouble 
> > because
> > of leaving unflushed entries. Maybe a WARN() is more appropriate? Also, 
> > maybe
> > a failure in the list flush should try a flush all as a fallback, with the 
> > WARN()
> > only if the flush all fails.
> >
> 
> Good idea. How about we restructure this routine to sth. like this:
> 
> 
>       memset(input, 0, sizeof(*input));
>       count = hv_iommu_fill_iova_list(...);
> 
>       if (count != HV_IOMMU_FLUSH_VA_OVERFLOW) {
>               input->device_domain = ...;
>               ...
>               status = hv_do_rep_hypercall(FLUSH_DEVICE_DOMAIN_LIST, ...);
>               if (hv_result_success(status))
>                       goto out;
>       }
> 
>       /* overflow or list flush failed: fallback to full domain flush */
>       flush_all = (void *)input;
>       memset(flush_all, 0, sizeof(*flush_all));
>       flush_all->device_domain = ...;
>       status = hv_do_hypercall(FLUSH_DEVICE_DOMAIN, ...);
>       WARN(!hv_result_success(status), "IOTLB flush failed, status %lld\n", 
> status);
> 
>       out:
>               local_irq_restore(flags);
> 

Yes, I think this works. But per my earlier comment, if I'm right that
the overflow case never occurs, it could be simplified further to just
do the list flush with the full flush as the error fallback. Then WARN
if the full flush fails.

Michael

Reply via email to