Re: RFR: 8291555: Implement alternative fast-locking scheme [v52]

David Holmes Mon, 03 Apr 2023 23:51:19 -0700

On Mon, 3 Apr 2023 11:05:41 GMT, Roman Kennke <rken...@openjdk.org> wrote:


>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is fixed size, currently with 8 elements. According to my 
>> experiments with various workloads, this covers the vast majority of 
>> workloads (in-fact, most workloads seem to never exceed 5 active locks per 
>> thread at a time). We check for overflow in the fast-paths and when the 
>> lock-stack is full, we take the slow-path, which would inflate the lock to a 
>> monitor. That case should be very rare.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation protocol. 
>> This would be implemented in a separate PR
>> 
>> Also, and I might be mistaken here, this new lightweight locking would make 
>> synchronized work better with Loom: Because the lock-records are no longer 
>> scattered across the stack, but instead are densely packed into the 
>> lock-stack, it should be easy for a vthread to save its lock-stack upon 
>> unmounting and restore it when re-mounting. However, I am not sure about 
>> this, and this PR does not attempt to implement that support.
>> 
>> Testing:
>>  - [x] tier1 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier2 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier3 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier4 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier1 x86_64 x aarch64 x -UseFastLocking
>>  - [x] tier2 x86_64 x aarch64 x -UseFastLocking
>>  - [x] tier3 x86_64 x aarch64 x -UseFastLocking
>>  - [x] tier4 x86_64 x aarch64 x -UseFastLocking
>>  - [x] Several real-world applications have been tested with this change in 
>> tandem with Lilliput without any problems, yet
>> 
>> ### Performance
>> 
>> #### Simple Microbenchmark
>> 
>> The microbenchmark exercises only the locking primitives for monitorenter 
>> and monitorexit, without contention. The benchmark can be found 
>> (here)[https://github.com/rkennke/fastlockbench]. Numbers are in ns/ops.
>> 
>> |  | x86_64 | aarch64 |
>> | -- | -- | -- |
>> | -UseFastLocking | 20.651 | 20.764 |
>> | +UseFastLocking | 18.896 | 18.908 |
>> 
>> 
>> #### Renaissance
>> 
>>   | x86_64 |   |   |   | aarch64 |   |  
>> -- | -- | -- | -- | -- | -- | -- | --
>>   | stack-locking | fast-locking |   |   | stack-locking | fast-locking |  
>> AkkaUct | 841.884 | 836.948 | 0.59% |   | 1475.774 | 1465.647 | 0.69%
>> Reactors | 11041.427 | 11181.451 | -1.25% |   | 11381.751 | 11521.318 | 
>> -1.21%
>> Als | 1367.183 | 1359.358 | 0.58% |   | 1678.103 | 1688.067 | -0.59%
>> ChiSquare | 577.021 | 577.398 | -0.07% |   | 986.619 | 988.063 | -0.15%
>> GaussMix | 817.459 | 819.073 | -0.20% |   | 1154.293 | 1155.522 | -0.11%
>> LogRegression | 598.343 | 603.371 | -0.83% |   | 638.052 | 644.306 | -0.97%
>> MovieLens | 8248.116 | 8314.576 | -0.80% |   | 7569.219 | 7646.828 | -1.01%%
>> NaiveBayes | 587.607 | 581.608 | 1.03% |   | 541.583 | 550.059 | -1.54%
>> PageRank | 3260.553 | 3263.472 | -0.09% |   | 4376.405 | 4381.101 | -0.11%
>> FjKmeans | 979.978 | 976.122 | 0.40% |   | 774.312 | 771.235 | 0.40%
>> FutureGenetic | 2187.369 | 2183.271 | 0.19% |   | 2685.722 | 2689.056 | 
>> -0.12%
>> ParMnemonics | 2434.551 | 2468.763 | -1.39% |   | 4278.225 | 4263.863 | 0.34%
>> Scrabble | 111.882 | 111.768 | 0.10% |   | 151.796 | 153.959 | -1.40%
>> RxScrabble | 210.252 | 211.38 | -0.53% |   | 310.116 | 315.594 | -1.74%
>> Dotty | 750.415 | 752.658 | -0.30% |   | 1033.636 | 1036.168 | -0.24%
>> ScalaDoku | 3072.05 | 3051.2 | 0.68% |   | 3711.506 | 3690.04 | 0.58%
>> ScalaKmeans | 211.427 | 209.957 | 0.70% |   | 264.38 | 265.788 | -0.53%
>> ScalaStmBench7 | 1017.795 | 1018.869 | -0.11% |   | 1088.182 | 1092.266 | 
>> -0.37%
>> Philosophers | 6450.124 | 6565.705 | -1.76% |   | 12017.964 | 11902.559 | 
>> 0.97%
>> FinagleChirper | 3953.623 | 3972.647 | -0.48% |   | 4750.751 | 4769.274 | 
>> -0.39%
>> FinagleHttp | 3970.526 | 4005.341 | -0.87% |   | 5294.125 | 5296.224 | -0.04%
>
> Roman Kennke has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fix typo

Thanks for previous updates - much appreciated.

Another round of comments and some concerns.

One thing I can't quite get clear in my head is whether the small window where 
an object's monitor is inflated and the object is still in the thread's 
lock-stack, could cause an issue for any external observers trying to determine 
the object's locked state.

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 9739:

> 9737:   get_thread(thread);
> 9738: #endif
> 9739:   subl(Address(thread, JavaThread::lock_stack_top_offset()), oopSize);

Is this code used for monitorexit or only returning from synchronized methods? 
If used for monitorexit there is no requirement that the monitor being unlocked 
was the last monitor locked. Balanced locking only requires the locks and 
unlocks are matched, not that they are perfectly nested ie. this is valid in 
bytecode:

monitorenter A
monitorenter B
monitorexit A
monitorExit B

src/hotspot/share/runtime/javaThread.cpp line 1388:

> 1386:   }
> 1387: 
> 1388:   if (!UseHeavyMonitors && LockingMode == 2) {

Given UseHeavyMonitors implies LockingMode==0 it suffices to just check 
LockingMode==2 here.

src/hotspot/share/runtime/lockStack.inline.hpp line 53:

> 51:   if (!thread->is_Java_thread()) {
> 52:     return false;
> 53:   }

I'm still unclear how we can have non-JavaThreads here. Only JavaThreads can 
lock object monitors.

src/hotspot/share/runtime/synchronizer.cpp line 1283:

> 1281:         inf->set_owner_from_anonymous(current);
> 1282:         assert(current->is_Java_thread(), "must be Java thread");
> 1283:         JavaThread::cast(current)->lock_stack().remove(object);

JavaThread::cast already asserts it is a JavaThread.

src/hotspot/share/runtime/synchronizer.cpp line 1314:

> 1312:     LogStreamHandle(Trace, monitorinflation) lsh;
> 1313:     if (mark.is_fast_locked()) {
> 1314:       assert(LockingMode == 2, "can only happen with new lightweight 
> locking");

I'd rather see this entire case guarded by "`if (LockingMode ==2)`" and have 
`is_fast_locked()` assert if called in any other mode. Similarly for the 
stack-locked case below.

src/hotspot/share/runtime/synchronizer.cpp line 1330:

> 1328:         // Success! Return inflated monitor.
> 1329:         if (own) {
> 1330:           assert(current->is_Java_thread(), "must be: checked in 
> is_lock_owned()");

Again this assert is not needed as `JavaThread::cast` will perform it.

src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/JavaVFrame.java 
line 85:

> 83:         ( // we have marked ourself as pending on this monitor
> 84:           mark.monitor().equals(thread.getCurrentPendingMonitor()) ||
> 85:           mark.monitor().isOwnedAnonymous() ||

Not at all clear to me how this fits here. ??

-------------

PR Review: https://git.openjdk.org/jdk/pull/10907#pullrequestreview-1370203589
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1156763358
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1156767200
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1156736240
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1156785509
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1156787971
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1156789340
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1156797838

Re: RFR: 8291555: Implement alternative fast-locking scheme [v52]

Reply via email to