On Wed Feb 26, 2025 at 4:51 PM GMT, Andrew Cooper wrote: > On 26/02/2025 4:42 pm, Jan Beulich wrote: > > On 26.02.2025 17:34, Andrew Cooper wrote: > >> On 26/02/2025 4:06 pm, Jan Beulich wrote: > >>> On 26.02.2025 17:04, Roger Pau Monné wrote: > >>>> On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote: > >>>>> On 26.02.2025 15:28, Roger Pau Monné wrote: > >>>>>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote: > >>>>>>> On 26.02.2025 14:56, Roger Pau Monné wrote: > >>>>>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: > >>>>>>>>> --- a/xen/common/page_alloc.c > >>>>>>>>> +++ b/xen/common/page_alloc.c > >>>>>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total > >>>>>>>>> outstanding claims by all domains */ > >>>>>>>>> > >>>>>>>>> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > >>>>>>>>> { > >>>>>>>>> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; > >>>>>>>>> - > >>>>>>>>> ASSERT(rspin_is_locked(&d->page_alloc_lock)); > >>>>>>>>> d->tot_pages += pages; > >>>>>>>>> > >>>>>>>>> /* > >>>>>>>>> - * can test d->claimed_pages race-free because it can only > >>>>>>>>> change > >>>>>>>>> + * can test d->outstanding_pages race-free because it can only > >>>>>>>>> change > >>>>>>>>> * if d->page_alloc_lock and heap_lock are both held, see also > >>>>>>>>> * domain_set_outstanding_pages below > >>>>>>>>> */ > >>>>>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct > >>>>>>>>> domain *d, long pages) > >>>>>>>>> goto out; > >>>>>>>> I think you can probably short-circuit the logic below if pages == 0? > >>>>>>>> (and avoid taking the heap_lock) > >>>>>>> Are there callers passing in 0? > >>>>>> Not sure, but if there are no callers expected we might add an ASSERT > >>>>>> to that effect then. > >>>>>> > >>>>>>>>> spin_lock(&heap_lock); > >>>>>>>>> - /* adjust domain outstanding pages; may not go negative */ > >>>>>>>>> - dom_before = d->outstanding_pages; > >>>>>>>>> - dom_after = dom_before - pages; > >>>>>>>>> - BUG_ON(dom_before < 0); > >>>>>>>>> - dom_claimed = dom_after < 0 ? 0 : dom_after; > >>>>>>>>> - d->outstanding_pages = dom_claimed; > >>>>>>>>> - /* flag accounting bug if system outstanding_claims would go > >>>>>>>>> negative */ > >>>>>>>>> - sys_before = outstanding_claims; > >>>>>>>>> - sys_after = sys_before - (dom_before - dom_claimed); > >>>>>>>>> - BUG_ON(sys_after < 0); > >>>>>>>>> - outstanding_claims = sys_after; > >>>>>>>>> + BUG_ON(outstanding_claims < d->outstanding_pages); > >>>>>>>>> + if ( pages > 0 && d->outstanding_pages < pages ) > >>>>>>>>> + { > >>>>>>>>> + /* `pages` exceeds the domain's outstanding count. Zero it > >>>>>>>>> out. */ > >>>>>>>>> + outstanding_claims -= d->outstanding_pages; > >>>>>>>>> + d->outstanding_pages = 0; > >>>>>>>>> + } else { > >>>>>>>>> + outstanding_claims -= pages; > >>>>>>>>> + d->outstanding_pages -= pages; > >>>>>>>> I wonder if it's intentional for a pages < 0 value to modify > >>>>>>>> outstanding_claims and d->outstanding_pages, I think those values > >>>>>>>> should only be set from domain_set_outstanding_pages(). > >>>>>>>> domain_adjust_tot_pages() should only decrease the value, but never > >>>>>>>> increase either outstanding_claims or d->outstanding_pages. > >>>>>>>> > >>>>>>>> At best the behavior is inconsistent, because once > >>>>>>>> d->outstanding_pages reaches 0 there will be no further modification > >>>>>>>> from domain_adjust_tot_pages(). > >>>>>>> Right, at that point the claim has run out. While freeing pages with > >>>>>>> an > >>>>>>> active claim means that the claim gets bigger (which naturally needs > >>>>>>> reflecting in the global). > >>>>>> domain_adjust_tot_pages() is not exclusively called when freeing > >>>>>> pages, see steal_page() for example. > >>>>> Or also when pages were allocated. steal_page() ... > >>>>> > >>>>>> When called from steal_page() it's wrong to increase the claim, as > >>>>>> it assumes that the page removed from d->tot_pages is freed, but > >>>>>> that's not the case. The domain might end up in a situation where > >>>>>> the claim is bigger than the available amount of memory. > >>>>> ... is a case that likely wasn't considered when the feature was added. > >>>>> > >>>>> I never really liked this; I'd be quite happy to see it ripped out, as > >>>>> long as we'd be reasonably certain it isn't in active use by people. > >>>> What do you mean with 'it' in the above sentence, the whole claim > >>>> stuff? > >>> Yes. > >>> > >>>> Or just getting rid of allowing the claim to increase as a > >>>> result of pages being removed from a domain? > >>> No. > >> Alejandro and I discussed this earlier in the week. > >> > >> The claim infrastructure stuff is critical for a toolstack capable of > >> doing things in parallel. > >> > >> However, it is also nonsensical for there to be a remaining claim by the > >> time domain construction is done. > > I'm not entirely sure about this. Iirc it was the tmem work where this was > > added, and then pretty certainly it was needed also for already running > > domains. > > It wasn't TMEM. It was generally large-memory VMs. > > The problem is if you've got 2x 2T VMs booting on a 3T system. > > Previously, you'd start building both of them, and minutes later they > both fail because Xen was fully out of memory. > > Claim was introduced to atomically reserve the memory you were intending > to build a domain with. > > For XenServer, we're working on NUMA fixes, and something that's > important there is to be able to reserve memory on a specific NUMA node > (hence why this is all getting looked at). > >> If creation_finished were a concrete thing, rather than a bodge hacked > >> into domain_unpause_by_systemcontroller(), it ought to be made to fail > >> if there were an outstanding claim. I suggested that we follow through > >> on a previous suggestion of making it a real hypercall (which is needed > >> by the encrypted VM crowd too). > > Rather than failing we could simply zap the leftover? > > Hmm. Perhaps. > > I'd be slightly wary about zapping a claim, but there should only be an > outstanding claim if the toolstack did something wrong. > > OTOH, we absolutely definitely do need a real hypercall here at some > point soon. > > ~Andrew
It should probably be removed at the end. Otherwise we're effectively leaking all claimed but non-used memory for the lifetime of the domain. Cheers, Alejandro