On 17/03/2025 6:48 pm, Andrea Bastoni wrote:
> Hi,
>
> In the last months we have been working with Stefano's team in AMD at tools to
> facilitate the analysis of "anonymous" memory allocations performed at 
> "runtime"
> from the shared Xen heap.
>
> - Anonymous here means: allocations that insist on the common Xen heap pool
>   (i.e., not on a per-domain heap pool).
> - Runtime here means: system_state >= SYS_STATE_active.
>
> TL;DR: A set of patches print a warning stack when anonymous allocations are
> detected at runtime. Scripts help to parse the stack traces that can be 
> checked
> to evaluate how problematic certain allocations paths can be or how difficult 
> it
> could be to update them. A full example of the results is e.g., in [1], and 
> the
> Readme-details in [2]. Below in the email more details and some commented 
> stack
> traces.
>
> Feedback, especially on the paths that are currently marked as "UNDECIDED"
> is very welcome.
>
> [1]: 
> https://gitlab.com/minervasys/public/xen/-/blob/minerva/warn/minerva_analysis/1674833972/parsed/x86_64/xilinx-hyperlaunch-x86_64-gcc-debug-virtio-pci-net
> [2]: 
> https://gitlab.com/minervasys/public/xen/-/blob/minerva/warn/minerva_analysis/README.md

Hello,

A few observations.

As you note, assigning ownership to current->domain is usually the wrong
thing.  Setting PGC_allocated however is worse; you've created a
security vulnerability where guests can free memory that Xen things is
private to it.

Similarly, you can't interchange xenheap and domheap like that.  It will
work on small systems, but not on large ones.  Nevertheless, it's an ok
start for debugging.

Ages ago, I tried to make a start on this problem,
https://xenbits.xen.org/gitweb/?p=people/andrewcoop/xen.git;a=shortlog;h=refs/heads/xen-fault-ttl
simply by counting the number of xmalloc()'s to the domain with with
they were logically associated.  It's stalled because the
fault-injection test found a refcounting bug and there isn't even an
agreement on whether it's a bug or not, or how to fix it.


I'm struggling to follow the analysis on
https://gitlab.com/xen-project/people/sstabellini/xen/-/jobs/9207014172 
However, it appears to be a mix of reports between arm64 and x86_64
jobs, and they need separating to get a clearer picture.

The bottom of each stack trace appears to be doubled.  Is that intentional?

Part of the problem is that xmalloc() isn't separated from
{xen,dom}heap, despite the former being layered on top of the latter. 
All *heap pages are going to a multiple of 4k as that's the allocation
unit.  It will probably be better to terminate at _xmalloc() and ignore
the call down into alloc_domheap_page().  Importantly, _malloc
allocating another 4k page isn't strictly related to the caller, when
it's doing so to replenish it's free pool.


The 1-byte allocations are silly.  bitmap_to_xenctl_bitmap() is buggy
for trying to cater to an endianness case we simply don't support. 
There's no memory allocation needed at all.  There are patches to
resolve this from aaages back, blocked on a request to fix up other
endian helpers in the meantime.

evtchn_fifo_init_control() seems to show up a lot, and seems to be the
only dynamic allocation triggered by the guest itself.  It's reasonably
large, but probably ought to be allocated up front.  event_fifo is
pretty ubiquitous these days.  I'm not sure where the alternative sizes
come from; they ought to all be 392 bytes.

I'm concerned by the xfree() in efi_runtime_call() seeing as there's no
associated allocation that I can see.


Finally, it's not just anonymous allocations you need to worry about. 
Even ones out of the p2m memory pool are still fatal at the point of
exhaustion, and can cause safety issues.

~Andrew

Reply via email to