On Fri, 6 Sep 2024 10:35:39 GMT, Thomas Stuefe <stu...@openjdk.org> wrote:

>> We have a healthy amount of tests already. We need a wide use of 
>> assert_locked and need to take a close look at error reporting. Locking, in 
>> NMT, is rather simple.
>> 
>> Error handling: My pragmatic "good enough for this rare scenario" proposal 
>> would be: 
>> 
>> At the entry of error reporting, if the NMT mutex is locked by the crashing 
>> thread, just unlock it manually and try to continue. The chance is high that 
>> this will actually work well enough to survive the few mallocs we may do 
>> during error reporting. 
>> 
>> Reasoning: We use NMT mutex as we did ThreadCritical: rather widely scoped. 
>> The time windows where things are out of sync and reentrant use of the same 
>> functionality can bite us are small. Moreover, NMT's two subsystems - malloc 
>> tracker and mmap tracker - are mostly independent from each other, and even 
>> leaving the mmap tracker in an inconsistent state will not prevent us from 
>> doing malloc during error reporting. We don't use mmap during error 
>> reporting, just malloc. Finally, moreover, NMT malloc code is comparatively 
>> simple and stable; mmap code is more complex, and we plan to rework parts of 
>> it in the near future.
>> 
>> TL;DR I think just manually releasing the lock in error reporting has a high 
>> chance of succeeding. Especially if we disable the NMT report in the hs-err 
>> file I am not even sure this is necessary, though. We have safeguards 
>> against error reporting locking up (step timeouts and secondary crash 
>> guards).
>
> Thinking through this more, we can probably shorten my advice to this:
> 
> When entering error handling, if NMT lock is held by crashing thread and NMT 
> level is detail, switch back to NMT level summary.
> 
> That's it.
> 
> Reasoning: we only do mallocs inside hs-err generation. For malloc, we only 
> lock inside NMT if the NMT level is detail. And this should never change: 
> malloc is too hot to justify a lock in summary mode.
> 
> The only point where possible crashes or deadlocks can occur in error 
> reporting is when doing the NMT report itself - but chances for this to 
> happen are remote, and we have secondary crash handling and error step 
> timeouts to deal with this if it happens.

Ok I've added some logic in `VMError::report` to switch to summary mode (I 
think this is the right place). However, if we need to do error reporting while 
the `NmtGuard` is already held, then I think we might still have reentrancy 
even in summary mode. After the latest commit that replaces 
`MemoryFileTracker`'s locker with `NmtGuard`, `MemBaseline::baseline_summary()` 
now acquires `NmtGuard` too.   

Because of this I've also made `NmtGuard` reentrant. Maybe handling reentrancy 
would be beneficial in future/unforeseen situations as well. But if this is 
unnecessary, I don't mind removing it.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20852#discussion_r1748744931

Reply via email to