On Tue, Jan 14, 2025 at 05:42:57PM +0000, Peter Maydell wrote: > On Tue, 14 Jan 2025 at 17:02, Peter Xu <pet...@redhat.com> wrote: > > > > On Tue, Jan 14, 2025 at 05:43:09PM +0900, Akihiko Odaki wrote: > > > memory_region_finalize() is not a function to tell the owner is leaving, > > > but > > > the memory region itself is being destroyed. > > > > It is when the lifecycle of the MR is the same as the owner. That holds > > true I suppose if without this patch, and that's why I don't prefer this > > patch because it makes that part more complicated. > > > > > It should not happen when a container is still referencing it. That is > > > also why it has memory_region_ref(subregion) in > > > memory_region_update_container_subregions() and assert(!mr->container) in > > > memory_region_finalize(). > > > > Again, the line I added was sololy for what you said "automation" elsewhere > > and only should work within MR-links within the same owner. Otherwise > > anyone referencing the MR would hold the owner ref then this finalize() > > will never happen. > > > > Now, if I could go back to your original purpose of this work, quotting > > from your cover letter: > > > > > I saw various sanitizer errors when running check-qtest-ppc64. While > > > I could just turn off sanitizers, I decided to tackle them this time. > > > > > > Unfortunately, GLib versions older than 2.81.0 do not free test data in > > > some cases so some sanitizer errors remain. All sanitizer errors will be > > > gone with this patch series combined with the following change for GLib: > > > https://gitlab.gnome.org/GNOME/glib/-/merge_requests/4120 > > > > Is check-qtest-ppc64 the only one that will trigger this issue? Does it > > mean that most of the devices will do proper removal of device-owned > > subregions (hence, not prone to circular reference of owner refcount) > > except some devices in ppc64? > > There's at least one test in the arm qtests that will hit this. > I suspect that you'll find that most architectures except x86 > (where we don't have models of complex SoCs and the few > machines we do have tend to be old code that is less QOMified) > will hit similar issues. I think there's a general issue here, > this isn't just "some particular ppc device is wrongly coded".
I see. Do you know how many of them would be important memory leaks that we should fix immediately? I mean, we have known memory leaks in QEMU in many places I assume. I am curious how important this problem is, and whether such would justify a memory API change that is not reaching a quorum state (and, imho, add complexity to memory core and of course that spreads to x86 too even if it was not affected) to be merged. Or perhaps we can fix the important ones first from the device model directly instead. It's not new to me that QEMU can leave some memory allocated for the whole lifecycle of the process. E.g. I just worked on something for migration that we could have UAF on migration object. I tried to provide a core fix via QOM singleton but unfortunately that didn't yet got accepted, so migration still face such UAF. It was not accepted because of some reasons and reviewer concerns, so I suppose that's fair that until we reach a consensus on an acceptable and clean general solution, we leave that issue be there if it's a corner case anyway - in migration that was the case. For this specific case, my current understanding is the important leaks are where it can e.g. get devices frequently plugged and unplugged with can cause QEMU to bloat till host OOM. For those cases I wonder whether (even if we want to provide a global sulution... but while before it settles..) we could fix them first by correctly detach the subregions just like what x86 does. Thanks, -- Peter Xu