On Mon, 24 Mar 2025 16:16:34 GMT, Stefan Karlsson <[email protected]> wrote:
>> ### Summary:
>> This PR makes memory operations atomic with NMT accounting.
>>
>> ### The problem:
>> In memory related functions like `os::commit_memory` and
>> `os::reserve_memory` the OS memory operations are currently done before
>> acquiring the the NMT mutex. And the the virtual memory accounting is done
>> later in `MemTracker`, after the lock has been acquired. Doing the memory
>> operations outside of the lock scope can lead to races.
>>
>> 1.1 Thread_1 releases range_A.
>> 1.2 Thread_1 tells NMT "range_A has been released".
>>
>> 2.1 Thread_2 reserves (the now free) range_A.
>> 2.2 Thread_2 tells NMT "range_A is reserved".
>>
>> Since the sequence (1.1) (1.2) is not atomic, if Thread_2 begins operating
>> after (1.1), we can have (1.1) (2.1) (2.2) (1.2). The OS sees two valid
>> subsequent calls (release range_A, followed by map range_A). But NMT sees
>> "reserve range_A", "release range_A" and is now out of sync with the OS.
>>
>> ### Solution:
>> Where memory operations such as reserve, commit, or release virtual memory
>> happen, I've expanded the scope of `NmtVirtualMemoryLocker` to protect both
>> the NMT accounting and the memory operation itself.
>>
>> ### Other notes:
>> I also simplified this pattern found in many places:
>>
>> if (MemTracker::enabled()) {
>> MemTracker::NmtVirtualMemoryLocker nvml;
>> result = pd_some_operation(addr, bytes);
>> if (result != nullptr) {
>> MemTracker::record_some_operation(addr, bytes);
>> }
>> } else {
>> result = pd_unmap_memory(addr, bytes);
>> }
>> ```
>> To:
>>
>> MemTracker::NmtVirtualMemoryLocker nvml;
>> result = pd_unmap_memory(addr, bytes);
>> MemTracker::record_some_operation(addr, bytes);
>> ```
>> This is possible because `NmtVirtualMemoryLocker` now checks
>> `MemTracker::enabled()`. `MemTracker::record_some_operation` already checks
>> `MemTracker::enabled()` and checks against nullptr. This refactoring
>> previously wasn't possible because `ThreadCritical` was used before
>> https://github.com/openjdk/jdk/pull/22745 introduced
>> `NmtVirtualMemoryLocker`.
>>
>> I considered moving the locking and NMT accounting down into platform
>> specific code: Ex. lock around { munmap() + MemTracker::record }. The hope
>> was that this would help reduce the size of the critical section. However, I
>> found that the OS-specific "pd_" functions are already short and
>> to-the-point, so doing this wasn't reducing the lock scope very much.
>> Instead it just makes the code more messy by having to maintain the locking
>> and NMT accounting in each platform specific i...
>
> I don't see why we need to extend the lock to be held over the
> reserve/commit/alloc part.
>
> If we only extend the lock on the release side, then it looks like we get the
> required exclusion:
>
> lock
> 1.1 Thread_1 releases range_A.
> 1.2 Thread_1 tells NMT "range_A has been released".
> unlock
>
> 2.1 Thread_2 reserves (the now free) range_A.
> lock
> 2.2 Thread_2 tells NMT "range_A is reserved".
> unlock
>
> We can get ordering (1.1) (2.1) (1.2) (2.2) but we can't get (1.1) (2.1)
> (2.2) (1.2).
>
> And isn't this locking scheme exactly what the current code is using? Have
> you seen an issue that this proposed PR intends to solve? If there is such a
> problem I wonder if there's just a missing lock extension in one of the
> "release" operations.
Hi @stefank, I think you're right about (1.1) (2.1) (2.2) (1.2) being prevented
by the current implementation. Is there a reason that the current
implementation only does the wider locking for release/uncommit? Maybe (2.1)
(1.1) (1.2) (2.2) isn't much of an issue since it's unlikely that another
thread would uncommit/release the same base address shortly after it's
committed/reserved?
>Have you seen an issue that this proposed PR intends to solve? If there is
>such a problem I wonder if there's just a missing lock extension in one of the
>"release" operations.
I haven't seen that race in the wild, I just noticed that the memory operations
weren't protected and thought that it could be a problem.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/24084#issuecomment-2748848175