On Mon, 24 Mar 2025 16:16:34 GMT, Stefan Karlsson <stef...@openjdk.org> wrote:
>> ### Summary: >> This PR makes memory operations atomic with NMT accounting. >> >> ### The problem: >> In memory related functions like `os::commit_memory` and >> `os::reserve_memory` the OS memory operations are currently done before >> acquiring the the NMT mutex. And the the virtual memory accounting is done >> later in `MemTracker`, after the lock has been acquired. Doing the memory >> operations outside of the lock scope can lead to races. >> >> 1.1 Thread_1 releases range_A. >> 1.2 Thread_1 tells NMT "range_A has been released". >> >> 2.1 Thread_2 reserves (the now free) range_A. >> 2.2 Thread_2 tells NMT "range_A is reserved". >> >> Since the sequence (1.1) (1.2) is not atomic, if Thread_2 begins operating >> after (1.1), we can have (1.1) (2.1) (2.2) (1.2). The OS sees two valid >> subsequent calls (release range_A, followed by map range_A). But NMT sees >> "reserve range_A", "release range_A" and is now out of sync with the OS. >> >> ### Solution: >> Where memory operations such as reserve, commit, or release virtual memory >> happen, I've expanded the scope of `NmtVirtualMemoryLocker` to protect both >> the NMT accounting and the memory operation itself. >> >> ### Other notes: >> I also simplified this pattern found in many places: >> >> if (MemTracker::enabled()) { >> MemTracker::NmtVirtualMemoryLocker nvml; >> result = pd_some_operation(addr, bytes); >> if (result != nullptr) { >> MemTracker::record_some_operation(addr, bytes); >> } >> } else { >> result = pd_unmap_memory(addr, bytes); >> } >> ``` >> To: >> >> MemTracker::NmtVirtualMemoryLocker nvml; >> result = pd_unmap_memory(addr, bytes); >> MemTracker::record_some_operation(addr, bytes); >> ``` >> This is possible because `NmtVirtualMemoryLocker` now checks >> `MemTracker::enabled()`. `MemTracker::record_some_operation` already checks >> `MemTracker::enabled()` and checks against nullptr. This refactoring >> previously wasn't possible because `ThreadCritical` was used before >> https://github.com/openjdk/jdk/pull/22745 introduced >> `NmtVirtualMemoryLocker`. >> >> I considered moving the locking and NMT accounting down into platform >> specific code: Ex. lock around { munmap() + MemTracker::record }. The hope >> was that this would help reduce the size of the critical section. However, I >> found that the OS-specific "pd_" functions are already short and >> to-the-point, so doing this wasn't reducing the lock scope very much. >> Instead it just makes the code more messy by having to maintain the locking >> and NMT accounting in each platform specific i... > > I don't see why we need to extend the lock to be held over the > reserve/commit/alloc part. > > If we only extend the lock on the release side, then it looks like we get the > required exclusion: > > lock > 1.1 Thread_1 releases range_A. > 1.2 Thread_1 tells NMT "range_A has been released". > unlock > > 2.1 Thread_2 reserves (the now free) range_A. > lock > 2.2 Thread_2 tells NMT "range_A is reserved". > unlock > > We can get ordering (1.1) (2.1) (1.2) (2.2) but we can't get (1.1) (2.1) > (2.2) (1.2). > > And isn't this locking scheme exactly what the current code is using? Have > you seen an issue that this proposed PR intends to solve? If there is such a > problem I wonder if there's just a missing lock extension in one of the > "release" operations. Hi @stefank, I think you're right about (1.1) (2.1) (2.2) (1.2) being prevented by the current implementation. Is there a reason that the current implementation only does the wider locking for release/uncommit? Maybe (2.1) (1.1) (1.2) (2.2) isn't much of an issue since it's unlikely that another thread would uncommit/release the same base address shortly after it's committed/reserved? >Have you seen an issue that this proposed PR intends to solve? If there is >such a problem I wonder if there's just a missing lock extension in one of the >"release" operations. I haven't seen that race in the wild, I just noticed that the memory operations weren't protected and thought that it could be a problem. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24084#issuecomment-2748848175