On Fri, 6 Sep 2024 04:27:44 GMT, David Holmes <dhol...@openjdk.org> wrote:

>> ### Summary
>> This PR just replaces `ThreadCritical` with a lock specific to NMT.  
>> `ThreadCritical` is a big lock and is unnecessary for the purposes of NMT. 
>> I've implemented the new lock with a semaphore so that it can be used early 
>> before VM init.  There is also the possibility of adding assertions in 
>> places we expect NMT to have synchronization. I haven't added assertions yet 
>> in many places because some code paths such as the (NMT tests)  don't lock 
>> yet. I think it makes sense to close any gaps in locking in another PR in 
>> which I can also add more assertions. 
>> 
>> Testing:
>> - hotspot_nmt
>> - gtest:VirtualSpace
>> - tier1
>
> src/hotspot/share/nmt/nmtCommon.hpp line 153:
> 
>> 151:       intx const current =  os::current_thread_id();
>> 152:       intx const owner = Atomic::load(&_owner);
>> 153:       assert(current != owner, "Lock is not reentrant");
> 
> ThreadCritical is reentrant though. We would need to do a lot of testing 
> and/or code analysis to ensure we don't have the possibility of needing a 
> reentrant lock with NMT. The most likely problems would be if we triggered a 
> crash whilst in a locked section of code.

We have a healthy amount of tests already. We need a wide use of assert_locked 
and need to take a close look at error reporting. Locking, in NMT, is rather 
simple.

Error handling: My pragmatic "good enough for this rare scenario" proposal 
would be: 

At the entry of error reporting, if the NMT mutex is locked by the crashing 
thread, just unlock it manually and try to continue. The chance is high that 
this will actually work well enough to survive the few mallocs we may do during 
error reporting. 

Reasoning: We use NMT mutex as we did ThreadCritical: rather widely scoped. The 
time windows where things are out of sync and reentrant use of the same 
functionality can bite us are small. Moreover, NMT's two subsystems - malloc 
tracker and mmap tracker - are mostly independent from each other, and even 
leaving the mmap tracker in an inconsistent state will not prevent us from 
doing malloc during error reporting. We don't use mmap during error reporting, 
just malloc. Finally, moreover, NMT malloc code is comparatively simple and 
stable; mmap code is more complex, and we plan to rework parts of it in the 
near future.

TL;DR I think just manually releasing the lock in error reporting has a high 
chance of succeeding. Especially if we disable the NMT report in the hs-err 
file I am not even sure this is necessary, though. We have safeguards against 
error reporting locking up (step timeouts and secondary crash guards).

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20852#discussion_r1746714270

Reply via email to