On Thu, 20 Apr 2023 11:15:47 GMT, Roman Kennke <rken...@openjdk.org> wrote:
>> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is fixed size, currently with 8 elements. According to my >> experiments with various workloads, this covers the vast majority of >> workloads (in-fact, most workloads seem to never exceed 5 active locks per >> thread at a time). We check for overflow in the fast-paths and when the >> lock-stack is full, we take the slow-path, which would inflate the lock to a >> monitor. That case should be very rare. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation protocol. >> This would be implemented in a separate PR >> >> Also, and I might be mistaken here, this new lightweight locking would make >> synchronized work better with Loom: Because the lock-records are no longer >> scattered across the stack, but instead are densely packed into the >> lock-stack, it should be easy for a vthread to save its lock-stack upon >> unmounting and restore it when re-mounting. However, I am not sure about >> this, and this PR does not attempt to implement that support. >> >> Testing: >> - [x] tier1 x86_64 x aarch64 x +UseFastLocking >> - [x] tier2 x86_64 x aarch64 x +UseFastLocking >> - [x] tier3 x86_64 x aarch64 x +UseFastLocking >> - [x] tier4 x86_64 x aarch64 x +UseFastLocking >> - [x] tier1 x86_64 x aarch64 x -UseFastLocking >> - [x] tier2 x86_64 x aarch64 x -UseFastLocking >> - [x] tier3 x86_64 x aarch64 x -UseFastLocking >> - [x] tier4 x86_64 x aarch64 x -UseFastLocking >> - [x] Several real-world applications have been tested with this change in >> tandem with Lilliput without any problems, yet >> >> ### Performance >> >> #### Simple Microbenchmark >> >> The microbenchmark exercises only the locking primitives for monitorenter >> and monitorexit, without contention. The benchmark can be found >> (here)[https://github.com/rkennke/fastlockbench]. Numbers are in ns/ops. >> >> | | x86_64 | aarch64 | >> | -- | -- | -- | >> | -UseFastLocking | 20.651 | 20.764 | >> | +UseFastLocking | 18.896 | 18.908 | >> >> >> #### Renaissance >> >> | x86_64 | | | | aarch64 | | >> -- | -- | -- | -- | -- | -- | -- | -- >> | stack-locking | fast-locking | | | stack-locking | fast-locking | >> AkkaUct | 841.884 | 836.948 | 0.59% | | 1475.774 | 1465.647 | 0.69% >> Reactors | 11041.427 | 11181.451 | -1.25% | | 11381.751 | 11521.318 | >> -1.21% >> Als | 1367.183 | 1359.358 | 0.58% | | 1678.103 | 1688.067 | -0.59% >> ChiSquare | 577.021 | 577.398 | -0.07% | | 986.619 | 988.063 | -0.15% >> GaussMix | 817.459 | 819.073 | -0.20% | | 1154.293 | 1155.522 | -0.11% >> LogRegression | 598.343 | 603.371 | -0.83% | | 638.052 | 644.306 | -0.97% >> MovieLens | 8248.116 | 8314.576 | -0.80% | | 7569.219 | 7646.828 | -1.01%% >> NaiveBayes | 587.607 | 581.608 | 1.03% | | 541.583 | 550.059 | -1.54% >> PageRank | 3260.553 | 3263.472 | -0.09% | | 4376.405 | 4381.101 | -0.11% >> FjKmeans | 979.978 | 976.122 | 0.40% | | 774.312 | 771.235 | 0.40% >> FutureGenetic | 2187.369 | 2183.271 | 0.19% | | 2685.722 | 2689.056 | >> -0.12% >> ParMnemonics | 2434.551 | 2468.763 | -1.39% | | 4278.225 | 4263.863 | 0.34% >> Scrabble | 111.882 | 111.768 | 0.10% | | 151.796 | 153.959 | -1.40% >> RxScrabble | 210.252 | 211.38 | -0.53% | | 310.116 | 315.594 | -1.74% >> Dotty | 750.415 | 752.658 | -0.30% | | 1033.636 | 1036.168 | -0.24% >> ScalaDoku | 3072.05 | 3051.2 | 0.68% | | 3711.506 | 3690.04 | 0.58% >> ScalaKmeans | 211.427 | 209.957 | 0.70% | | 264.38 | 265.788 | -0.53% >> ScalaStmBench7 | 1017.795 | 1018.869 | -0.11% | | 1088.182 | 1092.266 | >> -0.37% >> Philosophers | 6450.124 | 6565.705 | -1.76% | | 12017.964 | 11902.559 | >> 0.97% >> FinagleChirper | 3953.623 | 3972.647 | -0.48% | | 4750.751 | 4769.274 | >> -0.39% >> FinagleHttp | 3970.526 | 4005.341 | -0.87% | | 5294.125 | 5296.224 | -0.04% > > Roman Kennke has updated the pull request incrementally with one additional > commit since the last revision: > > Remove unnecessary comments src/hotspot/cpu/aarch64/aarch64.ad line 3875: > 3873: __ b(cont); > 3874: } else { > 3875: assert(LockingMode == LM_LIGHTWEIGHT, ""); Perhaps should be: s/""/"must be"/ I'm not fond of empty assert mesgs. src/hotspot/cpu/aarch64/aarch64.ad line 3956: > 3954: __ b(cont); > 3955: } else { > 3956: assert(LockingMode == LM_LIGHTWEIGHT, ""); Perhaps should be: s/""/"must be"/ I'm not fond of empty assert mesgs. src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp line 1813: > 1811: __ br(Assembler::NE, slow_path_lock); > 1812: } else { > 1813: assert(LockingMode == LM_LIGHTWEIGHT, ""); Perhaps should be: s/""/"must be"/ I'm not fond of empty assert mesgs. src/hotspot/cpu/arm/c2_MacroAssembler_arm.cpp line 100: > 98: fast_lock_2(Roop /* obj */, Rbox /* t1 */, Rscratch /* t2 */, > Rscratch2 /* t3 */, > 99: 1 /* savemask (save t1) */, > 100: done); Why not line up the '1' below the 'R' in Roop and join with the 'done);' line? src/hotspot/cpu/arm/c2_MacroAssembler_arm.cpp line 152: > 150: fast_unlock_2(Roop /* obj */, Rbox /* t1 */, Rscratch /* t2 */, > Rscratch2 /* t3 */, > 151: 1 /* savemask (save t1) */, > 152: done); Why not line up the '1' below the 'R' in Roop and join with the 'done);' line? src/hotspot/cpu/arm/interp_masm_arm.cpp line 900: > 898: b(done); > 899: > 900: } else if (LockingMode == LM_LEGACY) { Why so many blank lines in this new block? src/hotspot/cpu/arm/interp_masm_arm.cpp line 1025: > 1023: fast_unlock_2(Robj /* obj */, Rlock /* t1 */, Rmark /* t2 */, > Rtemp /* t3 */, > 1024: 1 /* savemask (save t1) */, > 1025: slow_case); Why not line up the '1' below the 'R' in Roop and join with the 'done);' line? src/hotspot/cpu/arm/macroAssembler_arm.cpp line 1803: > 1801: #ifdef ASSERT > 1802: // Poison scratch regs > 1803: POISON_REGS((~savemask), t1, t2, t3, 0x10000001); Should this poison value be: 0x20000002 src/hotspot/cpu/arm/macroAssembler_arm.cpp line 1811: > 1809: // Attempt to fast-unlock an object > 1810: // Registers: > 1811: // - obj: the object to be locked nit typo: s/locked/unlocked/ src/hotspot/cpu/arm/macroAssembler_arm.hpp line 1023: > 1021: // Attempt to fast-unlock an object > 1022: // Registers: > 1023: // - obj: the object to be locked nit typo: s/locked/unlocked/ src/hotspot/cpu/arm/sharedRuntime_arm.cpp line 649: > 647: > 648: __ flush(); > 649: return AdapterHandlerLibrary::new_entry(fingerprint, i2c_entry, > c2i_entry, c2i_unverified_entry); This change seems out of place... what's the story here? src/hotspot/cpu/arm/sharedRuntime_arm.cpp line 1246: > 1244: if (LockingMode == LM_LIGHTWEIGHT) { > 1245: log_trace(fastlock2)("SharedRuntime unlock fast"); > 1246: __ fast_unlock_2(sync_obj, R2, tmp, Rtemp, 7, slow_unlock); No comments on the params like in other places... src/hotspot/cpu/x86/macroAssembler_x86.cpp line 9694: > 9692: > 9693: // Now we attempt to take the fast-lock. > 9694: // Clear lowest two header bits (locked state). Perhaps: // Clear lock_mask bits (locked state). so that you don't tie this comment to the implementation size of the lock_mask. src/hotspot/cpu/x86/macroAssembler_x86.cpp line 9697: > 9695: andptr(hdr, ~(int32_t)markWord::lock_mask_in_place); > 9696: movptr(tmp, hdr); > 9697: // Set lowest bit (unlocked state). Perhaps: // Set unlocked_value bit. so that you don't tied this comment to the implementation of the unlocked_value being the lowest bit. I'm less worried about 'bit' versus 'bits' for this one. src/hotspot/cpu/x86/macroAssembler_x86.cpp line 9721: > 9719: assert_different_registers(obj, hdr, tmp); > 9720: > 9721: // Mark-word must be 00 now, try to swing it back to 01 (unlocked) Perhaps: // Mark-word must be lock_mask now, try to swing it back to unlocked_value. so that you don't tie this comment to the implementation values of lock_mask and unlocked_value. src/hotspot/cpu/x86/sharedRuntime_x86_32.cpp line 1717: > 1715: __ jcc(Assembler::notEqual, slow_path_lock); > 1716: } else { > 1717: assert(LockingMode == LM_LIGHTWEIGHT, ""); Perhaps should be: s/""/"must be"/ I'm not fond of empty assert mesgs. src/hotspot/cpu/x86/sharedRuntime_x86_32.cpp line 1876: > 1874: __ dec_held_monitor_count(); > 1875: } else { > 1876: assert(LockingMode == LM_LIGHTWEIGHT, ""); Perhaps should be: s/""/"must be"/ I'm not fond of empty assert mesgs. src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp line 2187: > 2185: __ jcc(Assembler::notEqual, slow_path_lock); > 2186: } else { > 2187: assert(LockingMode == LM_LIGHTWEIGHT, ""); Perhaps should be: s/""/"must be"/ I'm not fond of empty assert mesgs. src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp line 2331: > 2329: __ dec_held_monitor_count(); > 2330: } else { > 2331: assert(LockingMode == LM_LIGHTWEIGHT, ""); Perhaps should be: s/""/"must be"/ I'm not fond of empty assert mesgs. src/hotspot/share/interpreter/interpreterRuntime.cpp line 759: > 757: // also keep the BasicObjectLock, but we don't really need it anyway, we > only need > 758: // the object. See also InterpreterMacroAssembler::lock_object(). > 759: // As soon as traditional stack-locking goes away we could remove the > other monitorenter() entry Perhaps: s/traditional/legacy/ for terminology consistency... src/hotspot/share/logging/logTag.hpp line 80: > 78: LOG_TAG(exceptions) \ > 79: LOG_TAG(exit) \ > 80: LOG_TAG(fastlock2) \ So why 'fastlock2'? Where's 'fastlock1'? Or 'fastlock'? src/hotspot/share/oops/markWord.hpp line 175: > 173: } > 174: bool has_locker() const { > 175: assert(LockingMode == LM_LEGACY, "should only be called with > traditional stack locking"); Perhaps: s/traditional/legacy/ for terminology consistency... src/hotspot/share/runtime/arguments.cpp line 1994: > 1992: if (UseHeavyMonitors) { > 1993: FLAG_SET_CMDLINE(LockingMode, LM_MONITOR); > 1994: } HotSpot option processing has a general rule of last setting wins. With L1992-1994 here, I think there might be a problem with a cmd line that specifies: -XX:+UseHeavyMonitors -XX:LockingMode=1 I think that the resulting value of `LockingMode` will be `LM_MONITOR` instead of `LM_LEGACY`. Granted mixing uses of `UseHeavyMonitors` with `LockingMode` is just asking for trouble... src/hotspot/share/runtime/globals.hpp line 1981: > 1979: "Select locking mode: " > \ > 1980: "0: monitors only, " > \ > 1981: "1: monitors & traditional stack-locking (default), " > \ Perhaps: s/traditional/legacy/ to be consistent with terminolgy... src/hotspot/share/runtime/javaThread.hpp line 1162: > 1160: > 1161: > 1162: static OopStorage* thread_oop_storage(); nit: delete extra blank line on L1161 src/hotspot/share/runtime/lockStack.cpp line 41: > 39: LockStack::LockStack(JavaThread* jt) : > 40: _top(lock_stack_base_offset), _base() > 41: { nit: '{' on L41 should be at the end of L40 (after a space). src/hotspot/share/runtime/lockStack.cpp line 63: > 61: #ifndef PRODUCT > 62: void LockStack::verify(const char* msg) const { > 63: assert(LockingMode == LM_LIGHTWEIGHT, "never use lock-stack when > fast-locking is disabled"); Perhaps: s/fast-locking/light weight locking/ src/hotspot/share/runtime/lockStack.cpp line 64: > 62: void LockStack::verify(const char* msg) const { > 63: assert(LockingMode == LM_LIGHTWEIGHT, "never use lock-stack when > fast-locking is disabled"); > 64: assert((_top <= end_offset()), "lockstack overflow: _top %d end_offset > %d", _top, end_offset()); nit: extra space after `<=` src/hotspot/share/runtime/lockStack.cpp line 65: > 63: assert(LockingMode == LM_LIGHTWEIGHT, "never use lock-stack when > fast-locking is disabled"); > 64: assert((_top <= end_offset()), "lockstack overflow: _top %d end_offset > %d", _top, end_offset()); > 65: assert((_top >= start_offset()), "lockstack underflow: _topt %d > end_offset %d", _top, start_offset()); nit typo: s/_topt/_top/ src/hotspot/share/runtime/lockStack.inline.hpp line 74: > 72: _base[to_index(_top)] = nullptr; > 73: #endif > 74: assert(!contains(o), "entries must be unique"); Perhaps: assert(!contains(o), "entries must be unique: " PTR_FORMAT, p2i(o)); src/hotspot/share/runtime/lockStack.inline.hpp line 81: > 79: inline void LockStack::remove(oop o) { > 80: verify("pre-remove"); > 81: assert(contains(o), "entry must be present"); Perhaps: assert(contains(o), "entry must be present: " PTR_FORMAT, p2i(o)); src/hotspot/share/runtime/synchronizer.cpp line 506: > 504: if (!useHeavyMonitors()) { > 505: if (LockingMode == LM_LIGHTWEIGHT) { > 506: // Fast-locking does not use the 'lock' argument.. nit: extra period at the end. src/hotspot/share/runtime/synchronizer.cpp line 1049: > 1047: > 1048: if (mark.has_monitor()) { > 1049: // inflated monitor so header points to ObjectMonitor (tagged > pointer). nit: s/inflated/Inflated/ src/hotspot/share/runtime/synchronizer.cpp line 1077: > 1075: > 1076: if (mark.has_monitor()) { > 1077: // inflated monitor so header points to ObjectMonitor (tagged > pointer). nit: s/inflated/Inflated/ src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/Threads.java line 213: > 211: // refer to Threads::owning_thread_from_monitor_owner > 212: public JavaThread owningThreadFromMonitor(Address o) { > 213: assert(VM.getVM().getCommandLineFlag("LockingMode").getInt() != > 2); Please put a comment after that literal '2': assert(VM.getVM().getCommandLineFlag("LockingMode").getInt() != 2 /* LM_LIGHTWEIGHT */); src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/Threads.java line 231: > 229: > 230: public JavaThread owningThreadFromMonitor(ObjectMonitor monitor) { > 231: if (VM.getVM().getCommandLineFlag("LockingMode").getInt() == 2) { Please put a comment after that literal '2': if (VM.getVM().getCommandLineFlag("LockingMode").getInt() == 2 /* LM_LIGHTWEIGHT */) { ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175779923 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175782883 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175838145 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175841805 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175841903 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175842779 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175843663 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175846203 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175845306 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175846631 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175847483 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175847916 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176939876 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176943266 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176946313 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176950471 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176951363 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176952831 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176953184 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176960883 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176963809 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176965777 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177061329 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177064110 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177066319 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177067974 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177069462 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177071139 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177071982 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177085020 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177084574 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1178061523 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1178072667 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1178076085 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1178100632 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1178101434