On 01/02/2021 13:18, Jan Beulich wrote:
> On 30.01.2021 03:58, Andrew Cooper wrote:
>> +static int vmtrace_alloc_buffer(struct vcpu *v)
>> +{
>> +    struct domain *d = v->domain;
>> +    struct page_info *pg;
>> +    unsigned int i;
>> +
>> +    if ( !d->vmtrace_size )
>> +        return 0;
>> +
>> +    pg = alloc_domheap_pages(d, get_order_from_bytes(d->vmtrace_size),
>> +                             MEMF_no_refcount);
>> +    if ( !pg )
>> +        return -ENOMEM;
>> +
>> +    /*
>> +     * Getting the reference counting correct here is hard.
>> +     *
>> +     * All pages are now on the domlist.  They, or subranges within, will be
> "domlist" is too imprecise, as there's no list with this name. It's
> extra_page_list in this case (see also below).
>
>> +     * freed when their reference count drops to zero, which may any time
>> +     * between now and the domain teardown path.
>> +     */
>> +
>> +    for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ )
>> +        if ( unlikely(!get_page_and_type(&pg[i], d, PGT_writable_page)) )
>> +            goto refcnt_err;
>> +
>> +    /*
>> +     * We must only let vmtrace_free_buffer() take any action in the success
>> +     * case when we've taken all the refs it intends to drop.
>> +     */
>> +    v->vmtrace.pg = pg;
>> +
>> +    return 0;
>> +
>> + refcnt_err:
>> +    /*
>> +     * In the failure case, we must drop all the acquired typerefs thus far,
>> +     * skip vmtrace_free_buffer(), and leave domain_relinquish_resources() 
>> to
>> +     * drop the alloc refs on any remaining pages - some pages could already
>> +     * have been freed behind our backs.
>> +     */
>> +    while ( i-- )
>> +        put_page_and_type(&pg[i]);
>> +
>> +    return -ENODATA;
>> +}
> As said in reply on the other thread, PGC_extra pages don't get
> freed automatically. I too initially thought they would, but
> (re-)learned otherwise when trying to repro your claims on that
> other thread. For all pages you've managed to get the writable
> ref, freeing is easily done by prefixing the loop body above by
> put_page_alloc_ref(). For all other pages best you can do (I
> think; see the debugging patches I had sent on that other
> thread) is to try get_page() - if it succeeds, calling
> put_page_alloc_ref() is allowed. Otherwise we can only leak the
> respective page (unless going to further extents with trying to
> recover from the "impossible"), or assume the failure here was
> because it did get freed already.

Right - I'm going to insist on breaking apart orthogonal issues.

This refcounting issue isn't introduced by this series - this series
uses an established pattern, in which we've found a corner case.

The corner case is theoretical, not practical - it is not possible for a
malicious PV domain to take 2^43 refs on any of the pages in this
allocation.  Doing so would require an hours-long SMI, or equivalent,
and even then all malicious activity would be paused after 1s for the
time calibration rendezvous which would livelock the system until the
watchdog kicked in.


I will drop the comments, because in light of this discovery, they're
not correct.

We should fix the corner case, but that should be a separate patch. 
Whatever we do needs to start by writing down the the refcounting rules
first because its totally clear that noone understands them, and then a
change to all examples of this pattern adjusting (if necessary).

~Andrew

Reply via email to