On Thu, Jan 08, 2026 at 09:24:51AM +0100, Jan Beulich wrote:
> On 07.01.2026 18:56, Roger Pau Monne wrote:
> > The current logic splits the update of the amount of available memory in
> > the system (total_avail_pages) and pending claims into two separately
> > locked regions.  This leads to a window between counters adjustments where
> > the result of total_avail_pages - outstanding_claims doesn't reflect the
> > real amount of free memory available, and can return a negative value due
> > to total_avail_pages having been updated ahead of outstanding_claims.
> > 
> > Fix by adjusting outstanding_claims and d->outstanding_pages in the same
> > place where total_avail_pages is updated.  Note that accesses to
> > d->outstanding_pages is protected by the global heap_lock, just like
> > total_avail_pages or outstanding_claims.  Add a comment to the field
> > declaration, and also adjust the comment at the top of
> > domain_set_outstanding_pages() to be clearer in that regard.
> > 
> > Note that failures in assign_pages() causes the claimed amount that has
> > been allocated to be lost, as the amount is not added back to the domain
> > quota once pages are freed.  Given the intended usage of claims is limited
> > to initial physmap populate, and the current failure paths in
> > assign_pages() would lead to the domain being destroyed anyway, don't
> > add extra logic to recover the claimed amount on failure - it's just adding
> > complexity for no real benefit.
> > 
> > Fixes: 65c9792df600 ("mmu: Introduce XENMEM_claim_pages (subop of memory 
> > ops)")
> > Signed-off-by: Roger Pau Monné <[email protected]>
> > ---
> > Changes since v2:
> >  - Revert back to the approach in v1.
> 
> You didn't fully go back to v1. While ...
> 
> > @@ -548,9 +524,10 @@ int domain_set_outstanding_pages(struct domain *d, 
> > unsigned long pages)
> >      unsigned long claim, avail_pages;
> >  
> >      /*
> > -     * take the domain's page_alloc_lock, else all d->tot_page adjustments
> > -     * must always take the global heap_lock rather than only in the much
> > -     * rarer case that d->outstanding_pages is non-zero
> > +     * Two locks are needed here:
> > +     *  - d->page_alloc_lock: protects accesses to 
> > d->{tot,max,extra}_pages.
> > +     *  - heap_lock: protects accesses to d->outstanding_pages, 
> > total_avail_pages
> > +     *    and outstanding_claims.
> >       */
> >      nrspin_lock(&d->page_alloc_lock);
> >      spin_lock(&heap_lock);
> 
> ... the comment improvement is of course okay to keep, ...
> 
> > @@ -999,7 +976,7 @@ static struct page_info *alloc_heap_pages(
> >  {
> >      nodeid_t node;
> >      unsigned int i, buddy_order, zone, first_dirty;
> > -    unsigned long request = 1UL << order;
> > +    unsigned int request = 1UL << order;
> 
> ... this I'm less certain about (and if it was to be kept, it should also
> become 1U). For one, this bounds check:
> 
>     if ( (outstanding_claims + request > total_avail_pages) &&
> 
> ends up still being okay (perhaps except on Arm32, but the wrapping issue
> there is pre-existing, albeit possibly benign due to other constraints),
> but just because outstanding_claims is "long" (and it's unclear to me why
> it isn't "unsigned long").
> 
> And then, what exactly is it that you want the more narrow type for (the
> description says nothing in that regard)? The other relevant uses of the
> variable are
> 
>     avail[node][zone] -= request;
>     total_avail_pages -= request;
> 
> where both avail[][] and total_avail_pages are (unsigned) long (again
> unclear to me why for total_avail_pages it's plain long).
> 
> Oh, wait, it is ...
> 
> > @@ -1071,6 +1050,30 @@ static struct page_info *alloc_heap_pages(
> >      total_avail_pages -= request;
> >      ASSERT(total_avail_pages >= 0);
> >  
> > +    if ( d && d->outstanding_pages && !(memflags & MEMF_no_refcount) )
> > +    {
> > +        /*
> > +         * Adjust claims in the same locked region where total_avail_pages 
> > is
> > +         * adjusted, not doing so would lead to a window where the amount 
> > of
> > +         * free memory (avail - claimed) would be incorrect.
> > +         *
> > +         * Note that by adjusting the claimed amount here it's possible for
> > +         * pages to fail to be assigned to the claiming domain while 
> > already
> > +         * having been subtracted from d->outstanding_pages.  Such claimed
> > +         * amount is then lost, as the pages that fail to be assigned to 
> > the
> > +         * domain are freed without replenishing the claim.  This is fine 
> > given
> > +         * claims are only to be used during physmap population as part of
> > +         * domain build, and any failure in assign_pages() there will 
> > result in
> > +         * the domain being destroyed before creation is finished.  Losing 
> > part
> > +         * of the claim makes no difference.
> > +         */
> > +        unsigned int outstanding = min(d->outstanding_pages, request);
> 
> ... the desire to avoid use of min_t() here which wants "request" to be
> "unsigned int". At some point we'll want to change the struct domain fields
> to unsigned long anyway, at which point the above would need adjustment. It's
> well possible that such an adjustment would end up being to then use min_t().
> Imo we'd be better off using e.g.
> 
>         unsigned int outstanding = min(d->outstanding_pages + 0UL, request);
> 
> or even
> 
>         typeof(d->outstanding_pages) outstanding =
>             min(d->outstanding_pages + 0UL, request);
> 
> right away. In the latter case the decl wouldn't even need touching when the
> struct domain fields are promoted.

My preference would be:

unsigned long outstanding = min(d->outstanding_pages + 0UL, request);

If that's fine with you.

> > +        BUG_ON(outstanding > d->outstanding_pages);
> 
> Unlike in v1, where the min() was different, this is now dead code.

Oh, I need to adjust this so it's outstanding > outstanding_claims
instead.

Thanks, Roger.

Reply via email to