On Thu, 20 Apr 2023 11:15:47 GMT, Roman Kennke <rken...@openjdk.org> wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is fixed size, currently with 8 elements. According to my 
>> experiments with various workloads, this covers the vast majority of 
>> workloads (in-fact, most workloads seem to never exceed 5 active locks per 
>> thread at a time). We check for overflow in the fast-paths and when the 
>> lock-stack is full, we take the slow-path, which would inflate the lock to a 
>> monitor. That case should be very rare.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation protocol. 
>> This would be implemented in a separate PR
>> 
>> Also, and I might be mistaken here, this new lightweight locking would make 
>> synchronized work better with Loom: Because the lock-records are no longer 
>> scattered across the stack, but instead are densely packed into the 
>> lock-stack, it should be easy for a vthread to save its lock-stack upon 
>> unmounting and restore it when re-mounting. However, I am not sure about 
>> this, and this PR does not attempt to implement that support.
>> 
>> Testing:
>>  - [x] tier1 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier2 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier3 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier4 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier1 x86_64 x aarch64 x -UseFastLocking
>>  - [x] tier2 x86_64 x aarch64 x -UseFastLocking
>>  - [x] tier3 x86_64 x aarch64 x -UseFastLocking
>>  - [x] tier4 x86_64 x aarch64 x -UseFastLocking
>>  - [x] Several real-world applications have been tested with this change in 
>> tandem with Lilliput without any problems, yet
>> 
>> ### Performance
>> 
>> #### Simple Microbenchmark
>> 
>> The microbenchmark exercises only the locking primitives for monitorenter 
>> and monitorexit, without contention. The benchmark can be found 
>> (here)[https://github.com/rkennke/fastlockbench]. Numbers are in ns/ops.
>> 
>> |  | x86_64 | aarch64 |
>> | -- | -- | -- |
>> | -UseFastLocking | 20.651 | 20.764 |
>> | +UseFastLocking | 18.896 | 18.908 |
>> 
>> 
>> #### Renaissance
>> 
>>   | x86_64 |   |   |   | aarch64 |   |  
>> -- | -- | -- | -- | -- | -- | -- | --
>>   | stack-locking | fast-locking |   |   | stack-locking | fast-locking |  
>> AkkaUct | 841.884 | 836.948 | 0.59% |   | 1475.774 | 1465.647 | 0.69%
>> Reactors | 11041.427 | 11181.451 | -1.25% |   | 11381.751 | 11521.318 | 
>> -1.21%
>> Als | 1367.183 | 1359.358 | 0.58% |   | 1678.103 | 1688.067 | -0.59%
>> ChiSquare | 577.021 | 577.398 | -0.07% |   | 986.619 | 988.063 | -0.15%
>> GaussMix | 817.459 | 819.073 | -0.20% |   | 1154.293 | 1155.522 | -0.11%
>> LogRegression | 598.343 | 603.371 | -0.83% |   | 638.052 | 644.306 | -0.97%
>> MovieLens | 8248.116 | 8314.576 | -0.80% |   | 7569.219 | 7646.828 | -1.01%%
>> NaiveBayes | 587.607 | 581.608 | 1.03% |   | 541.583 | 550.059 | -1.54%
>> PageRank | 3260.553 | 3263.472 | -0.09% |   | 4376.405 | 4381.101 | -0.11%
>> FjKmeans | 979.978 | 976.122 | 0.40% |   | 774.312 | 771.235 | 0.40%
>> FutureGenetic | 2187.369 | 2183.271 | 0.19% |   | 2685.722 | 2689.056 | 
>> -0.12%
>> ParMnemonics | 2434.551 | 2468.763 | -1.39% |   | 4278.225 | 4263.863 | 0.34%
>> Scrabble | 111.882 | 111.768 | 0.10% |   | 151.796 | 153.959 | -1.40%
>> RxScrabble | 210.252 | 211.38 | -0.53% |   | 310.116 | 315.594 | -1.74%
>> Dotty | 750.415 | 752.658 | -0.30% |   | 1033.636 | 1036.168 | -0.24%
>> ScalaDoku | 3072.05 | 3051.2 | 0.68% |   | 3711.506 | 3690.04 | 0.58%
>> ScalaKmeans | 211.427 | 209.957 | 0.70% |   | 264.38 | 265.788 | -0.53%
>> ScalaStmBench7 | 1017.795 | 1018.869 | -0.11% |   | 1088.182 | 1092.266 | 
>> -0.37%
>> Philosophers | 6450.124 | 6565.705 | -1.76% |   | 12017.964 | 11902.559 | 
>> 0.97%
>> FinagleChirper | 3953.623 | 3972.647 | -0.48% |   | 4750.751 | 4769.274 | 
>> -0.39%
>> FinagleHttp | 3970.526 | 4005.341 | -0.87% |   | 5294.125 | 5296.224 | -0.04%
>
> Roman Kennke has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Remove unnecessary comments

src/hotspot/cpu/aarch64/aarch64.ad line 3875:

> 3873:       __ b(cont);
> 3874:     } else {
> 3875:       assert(LockingMode == LM_LIGHTWEIGHT, "");

Perhaps should be: s/""/"must be"/
I'm not fond of empty assert mesgs.

src/hotspot/cpu/aarch64/aarch64.ad line 3956:

> 3954:       __ b(cont);
> 3955:     } else {
> 3956:       assert(LockingMode == LM_LIGHTWEIGHT, "");

Perhaps should be: s/""/"must be"/
I'm not fond of empty assert mesgs.

src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp line 1813:

> 1811:       __ br(Assembler::NE, slow_path_lock);
> 1812:     } else {
> 1813:       assert(LockingMode == LM_LIGHTWEIGHT, "");

Perhaps should be: s/""/"must be"/
I'm not fond of empty assert mesgs.

src/hotspot/cpu/arm/c2_MacroAssembler_arm.cpp line 100:

> 98:     fast_lock_2(Roop /* obj */, Rbox /* t1 */, Rscratch /* t2 */, 
> Rscratch2 /* t3 */,
> 99:                   1 /* savemask (save t1) */,
> 100:                   done);

Why not line up the '1' below the 'R' in Roop and join with the 'done);' line?

src/hotspot/cpu/arm/c2_MacroAssembler_arm.cpp line 152:

> 150:     fast_unlock_2(Roop /* obj */, Rbox /* t1 */, Rscratch /* t2 */, 
> Rscratch2 /* t3 */,
> 151:                     1 /* savemask (save t1) */,
> 152:                     done);

Why not line up the '1' below the 'R' in Roop and join with the 'done);' line?

src/hotspot/cpu/arm/interp_masm_arm.cpp line 900:

> 898:       b(done);
> 899: 
> 900:     } else if (LockingMode == LM_LEGACY) {

Why so many blank lines in this new block?

src/hotspot/cpu/arm/interp_masm_arm.cpp line 1025:

> 1023:       fast_unlock_2(Robj /* obj */, Rlock /* t1 */, Rmark /* t2 */, 
> Rtemp /* t3 */,
> 1024:                       1 /* savemask (save t1) */,
> 1025:                       slow_case);

Why not line up the '1' below the 'R' in Roop and join with the 'done);' line?

src/hotspot/cpu/arm/macroAssembler_arm.cpp line 1803:

> 1801: #ifdef ASSERT
> 1802:   // Poison scratch regs
> 1803:   POISON_REGS((~savemask), t1, t2, t3, 0x10000001);

Should this poison value be: 0x20000002

src/hotspot/cpu/arm/macroAssembler_arm.cpp line 1811:

> 1809: // Attempt to fast-unlock an object
> 1810: // Registers:
> 1811: //  - obj: the object to be locked

nit typo: s/locked/unlocked/

src/hotspot/cpu/arm/macroAssembler_arm.hpp line 1023:

> 1021:   // Attempt to fast-unlock an object
> 1022:   // Registers:
> 1023:   //  - obj: the object to be locked

nit typo: s/locked/unlocked/

src/hotspot/cpu/arm/sharedRuntime_arm.cpp line 649:

> 647: 
> 648:   __ flush();
> 649:   return AdapterHandlerLibrary::new_entry(fingerprint, i2c_entry, 
> c2i_entry, c2i_unverified_entry);

This change seems out of place... what's the story here?

src/hotspot/cpu/arm/sharedRuntime_arm.cpp line 1246:

> 1244:     if (LockingMode == LM_LIGHTWEIGHT) {
> 1245:       log_trace(fastlock2)("SharedRuntime unlock fast");
> 1246:       __ fast_unlock_2(sync_obj, R2, tmp, Rtemp, 7, slow_unlock);

No comments on the params like in other places...

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 9694:

> 9692: 
> 9693:   // Now we attempt to take the fast-lock.
> 9694:   // Clear lowest two header bits (locked state).

Perhaps:

  // Clear lock_mask bits (locked state).

so that you don't tie this comment to the implementation size of the lock_mask.

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 9697:

> 9695:   andptr(hdr, ~(int32_t)markWord::lock_mask_in_place);
> 9696:   movptr(tmp, hdr);
> 9697:   // Set lowest bit (unlocked state).

Perhaps:

 // Set unlocked_value bit.

so that you don't tied this comment to the implementation of the unlocked_value
being the lowest bit. I'm less worried about 'bit' versus 'bits' for this one.

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 9721:

> 9719:   assert_different_registers(obj, hdr, tmp);
> 9720: 
> 9721:   // Mark-word must be 00 now, try to swing it back to 01 (unlocked)

Perhaps:

  // Mark-word must be lock_mask now, try to swing it back to unlocked_value.

so that you don't tie this comment to the implementation values of
lock_mask and unlocked_value.

src/hotspot/cpu/x86/sharedRuntime_x86_32.cpp line 1717:

> 1715:       __ jcc(Assembler::notEqual, slow_path_lock);
> 1716:     } else {
> 1717:       assert(LockingMode == LM_LIGHTWEIGHT, "");

Perhaps should be: s/""/"must be"/
I'm not fond of empty assert mesgs.

src/hotspot/cpu/x86/sharedRuntime_x86_32.cpp line 1876:

> 1874:       __ dec_held_monitor_count();
> 1875:     } else {
> 1876:       assert(LockingMode == LM_LIGHTWEIGHT, "");

Perhaps should be: s/""/"must be"/
I'm not fond of empty assert mesgs.

src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp line 2187:

> 2185:       __ jcc(Assembler::notEqual, slow_path_lock);
> 2186:     } else {
> 2187:       assert(LockingMode == LM_LIGHTWEIGHT, "");

Perhaps should be: s/""/"must be"/
I'm not fond of empty assert mesgs.

src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp line 2331:

> 2329:       __ dec_held_monitor_count();
> 2330:     } else {
> 2331:       assert(LockingMode == LM_LIGHTWEIGHT, "");

Perhaps should be: s/""/"must be"/
I'm not fond of empty assert mesgs.

src/hotspot/share/interpreter/interpreterRuntime.cpp line 759:

> 757: // also keep the BasicObjectLock, but we don't really need it anyway, we 
> only need
> 758: // the object. See also InterpreterMacroAssembler::lock_object().
> 759: // As soon as traditional stack-locking goes away we could remove the 
> other monitorenter() entry

Perhaps:
s/traditional/legacy/
for terminology consistency...

src/hotspot/share/logging/logTag.hpp line 80:

> 78:   LOG_TAG(exceptions) \
> 79:   LOG_TAG(exit) \
> 80:   LOG_TAG(fastlock2) \

So why 'fastlock2'? Where's 'fastlock1'? Or 'fastlock'?

src/hotspot/share/oops/markWord.hpp line 175:

> 173:   }
> 174:   bool has_locker() const {
> 175:     assert(LockingMode == LM_LEGACY, "should only be called with 
> traditional stack locking");

Perhaps:
s/traditional/legacy/
for terminology consistency...

src/hotspot/share/runtime/arguments.cpp line 1994:

> 1992:   if (UseHeavyMonitors) {
> 1993:     FLAG_SET_CMDLINE(LockingMode, LM_MONITOR);
> 1994:   }

HotSpot option processing has a general rule of last setting wins.
With L1992-1994 here, I think there might be a problem with a cmd
line that specifies:

    -XX:+UseHeavyMonitors -XX:LockingMode=1

I think that the resulting value of `LockingMode` will be `LM_MONITOR`
instead of `LM_LEGACY`. Granted mixing uses of `UseHeavyMonitors`
with `LockingMode` is just asking for trouble...

src/hotspot/share/runtime/globals.hpp line 1981:

> 1979:           "Select locking mode: "                                       
>     \
> 1980:           "0: monitors only, "                                          
>     \
> 1981:           "1: monitors & traditional stack-locking (default), "         
>     \

Perhaps:
s/traditional/legacy/
to be consistent with terminolgy...

src/hotspot/share/runtime/javaThread.hpp line 1162:

> 1160: 
> 1161: 
> 1162:   static OopStorage* thread_oop_storage();

nit: delete extra blank line on L1161

src/hotspot/share/runtime/lockStack.cpp line 41:

> 39: LockStack::LockStack(JavaThread* jt) :
> 40:   _top(lock_stack_base_offset), _base()
> 41: {

nit: '{' on L41 should be at the end of L40 (after a space).

src/hotspot/share/runtime/lockStack.cpp line 63:

> 61: #ifndef PRODUCT
> 62: void LockStack::verify(const char* msg) const {
> 63:   assert(LockingMode == LM_LIGHTWEIGHT, "never use lock-stack when 
> fast-locking is disabled");

Perhaps:
s/fast-locking/light weight locking/

src/hotspot/share/runtime/lockStack.cpp line 64:

> 62: void LockStack::verify(const char* msg) const {
> 63:   assert(LockingMode == LM_LIGHTWEIGHT, "never use lock-stack when 
> fast-locking is disabled");
> 64:   assert((_top <=  end_offset()), "lockstack overflow: _top %d end_offset 
> %d", _top, end_offset());

nit: extra space after `<=`

src/hotspot/share/runtime/lockStack.cpp line 65:

> 63:   assert(LockingMode == LM_LIGHTWEIGHT, "never use lock-stack when 
> fast-locking is disabled");
> 64:   assert((_top <=  end_offset()), "lockstack overflow: _top %d end_offset 
> %d", _top, end_offset());
> 65:   assert((_top >= start_offset()), "lockstack underflow: _topt %d 
> end_offset %d", _top, start_offset());

nit typo: s/_topt/_top/

src/hotspot/share/runtime/lockStack.inline.hpp line 74:

> 72:   _base[to_index(_top)] = nullptr;
> 73: #endif
> 74:   assert(!contains(o), "entries must be unique");

Perhaps:

  assert(!contains(o), "entries must be unique: " PTR_FORMAT, p2i(o));

src/hotspot/share/runtime/lockStack.inline.hpp line 81:

> 79: inline void LockStack::remove(oop o) {
> 80:   verify("pre-remove");
> 81:   assert(contains(o), "entry must be present");

Perhaps:

  assert(contains(o), "entry must be present: " PTR_FORMAT, p2i(o));

src/hotspot/share/runtime/synchronizer.cpp line 506:

> 504:   if (!useHeavyMonitors()) {
> 505:     if (LockingMode == LM_LIGHTWEIGHT) {
> 506:       // Fast-locking does not use the 'lock' argument..

nit: extra period at the end.

src/hotspot/share/runtime/synchronizer.cpp line 1049:

> 1047: 
> 1048:   if (mark.has_monitor()) {
> 1049:     // inflated monitor so header points to ObjectMonitor (tagged 
> pointer).

nit: s/inflated/Inflated/

src/hotspot/share/runtime/synchronizer.cpp line 1077:

> 1075: 
> 1076:   if (mark.has_monitor()) {
> 1077:     // inflated monitor so header points to ObjectMonitor (tagged 
> pointer).

nit: s/inflated/Inflated/

src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/Threads.java line 
213:

> 211:     // refer to Threads::owning_thread_from_monitor_owner
> 212:     public JavaThread owningThreadFromMonitor(Address o) {
> 213:         assert(VM.getVM().getCommandLineFlag("LockingMode").getInt() != 
> 2);

Please put a comment after that literal '2':

        assert(VM.getVM().getCommandLineFlag("LockingMode").getInt() != 2 /* 
LM_LIGHTWEIGHT */);

src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/Threads.java line 
231:

> 229: 
> 230:     public JavaThread owningThreadFromMonitor(ObjectMonitor monitor) {
> 231:         if (VM.getVM().getCommandLineFlag("LockingMode").getInt() == 2) {

Please put a comment after that literal '2':

        if (VM.getVM().getCommandLineFlag("LockingMode").getInt() == 2 /* 
LM_LIGHTWEIGHT */) {

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175779923
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175782883
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175838145
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175841805
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175841903
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175842779
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175843663
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175846203
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175845306
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175846631
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175847483
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175847916
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176939876
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176943266
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176946313
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176950471
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176951363
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176952831
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176953184
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176960883
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176963809
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176965777
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177061329
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177064110
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177066319
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177067974
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177069462
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177071139
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177071982
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177085020
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177084574
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1178061523
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1178072667
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1178076085
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1178100632
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1178101434

Reply via email to