On Fri, 6 Sep 2024 10:35:39 GMT, Thomas Stuefe <stu...@openjdk.org> wrote:
>> We have a healthy amount of tests already. We need a wide use of >> assert_locked and need to take a close look at error reporting. Locking, in >> NMT, is rather simple. >> >> Error handling: My pragmatic "good enough for this rare scenario" proposal >> would be: >> >> At the entry of error reporting, if the NMT mutex is locked by the crashing >> thread, just unlock it manually and try to continue. The chance is high that >> this will actually work well enough to survive the few mallocs we may do >> during error reporting. >> >> Reasoning: We use NMT mutex as we did ThreadCritical: rather widely scoped. >> The time windows where things are out of sync and reentrant use of the same >> functionality can bite us are small. Moreover, NMT's two subsystems - malloc >> tracker and mmap tracker - are mostly independent from each other, and even >> leaving the mmap tracker in an inconsistent state will not prevent us from >> doing malloc during error reporting. We don't use mmap during error >> reporting, just malloc. Finally, moreover, NMT malloc code is comparatively >> simple and stable; mmap code is more complex, and we plan to rework parts of >> it in the near future. >> >> TL;DR I think just manually releasing the lock in error reporting has a high >> chance of succeeding. Especially if we disable the NMT report in the hs-err >> file I am not even sure this is necessary, though. We have safeguards >> against error reporting locking up (step timeouts and secondary crash >> guards). > > Thinking through this more, we can probably shorten my advice to this: > > When entering error handling, if NMT lock is held by crashing thread and NMT > level is detail, switch back to NMT level summary. > > That's it. > > Reasoning: we only do mallocs inside hs-err generation. For malloc, we only > lock inside NMT if the NMT level is detail. And this should never change: > malloc is too hot to justify a lock in summary mode. > > The only point where possible crashes or deadlocks can occur in error > reporting is when doing the NMT report itself - but chances for this to > happen are remote, and we have secondary crash handling and error step > timeouts to deal with this if it happens. Ok I've added some logic in `VMError::report` to switch to summary mode (I think this is the right place). However, if we need to do error reporting while the `NmtGuard` is already held, then I think we might still have reentrancy even in summary mode. After the latest commit that replaces `MemoryFileTracker`'s locker with `NmtGuard`, `MemBaseline::baseline_summary()` now acquires `NmtGuard` too. Because of this I've also made `NmtGuard` reentrant. Maybe handling reentrancy would be beneficial in future/unforeseen situations as well. But if this is unnecessary, I don't mind removing it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20852#discussion_r1748744931