Re: [PATCH v7 1/2] memory: Update inline documentation

Peter Xu Tue, 14 Jan 2025 11:13:40 -0800

On Tue, Jan 14, 2025 at 05:42:57PM +0000, Peter Maydell wrote:
> On Tue, 14 Jan 2025 at 17:02, Peter Xu <pet...@redhat.com> wrote:
> >
> > On Tue, Jan 14, 2025 at 05:43:09PM +0900, Akihiko Odaki wrote:
> > > memory_region_finalize() is not a function to tell the owner is leaving, 
> > > but
> > > the memory region itself is being destroyed.
> >
> > It is when the lifecycle of the MR is the same as the owner.  That holds
> > true I suppose if without this patch, and that's why I don't prefer this
> > patch because it makes that part more complicated.
> >
> > > It should not happen when a container is still referencing it. That is
> > > also why it has memory_region_ref(subregion) in
> > > memory_region_update_container_subregions() and assert(!mr->container) in
> > > memory_region_finalize().
> >
> > Again, the line I added was sololy for what you said "automation" elsewhere
> > and only should work within MR-links within the same owner.  Otherwise
> > anyone referencing the MR would hold the owner ref then this finalize()
> > will never happen.
> >
> > Now, if I could go back to your original purpose of this work, quotting
> > from your cover letter:
> >
> > > I saw various sanitizer errors when running check-qtest-ppc64. While
> > > I could just turn off sanitizers, I decided to tackle them this time.
> > >
> > > Unfortunately, GLib versions older than 2.81.0 do not free test data in
> > > some cases so some sanitizer errors remain. All sanitizer errors will be
> > > gone with this patch series combined with the following change for GLib:
> > > https://gitlab.gnome.org/GNOME/glib/-/merge_requests/4120
> >
> > Is check-qtest-ppc64 the only one that will trigger this issue?  Does it
> > mean that most of the devices will do proper removal of device-owned
> > subregions (hence, not prone to circular reference of owner refcount)
> > except some devices in ppc64?
> 
> There's at least one test in the arm qtests that will hit this.
> I suspect that you'll find that most architectures except x86
> (where we don't have models of complex SoCs and the few
> machines we do have tend to be old code that is less QOMified)
> will hit similar issues. I think there's a general issue here,
> this isn't just "some particular ppc device is wrongly coded".


I see.  Do you know how many of them would be important memory leaks that
we should fix immediately?

I mean, we have known memory leaks in QEMU in many places I assume.  I am
curious how important this problem is, and whether such would justify a
memory API change that is not reaching a quorum state (and, imho, add
complexity to memory core and of course that spreads to x86 too even if it
was not affected) to be merged.  Or perhaps we can fix the important ones
first from the device model directly instead.

It's not new to me that QEMU can leave some memory allocated for the whole
lifecycle of the process.  E.g. I just worked on something for migration
that we could have UAF on migration object.  I tried to provide a core fix
via QOM singleton but unfortunately that didn't yet got accepted, so
migration still face such UAF.  It was not accepted because of some reasons
and reviewer concerns, so I suppose that's fair that until we reach a
consensus on an acceptable and clean general solution, we leave that issue
be there if it's a corner case anyway - in migration that was the case.

For this specific case, my current understanding is the important leaks are
where it can e.g. get devices frequently plugged and unplugged with can
cause QEMU to bloat till host OOM.  For those cases I wonder whether (even
if we want to provide a global sulution... but while before it settles..)
we could fix them first by correctly detach the subregions just like what
x86 does.

Thanks,

-- 
Peter Xu

Re: [PATCH v7 1/2] memory: Update inline documentation

Reply via email to