On Fri, 28 Oct 2022 09:32:58 GMT, Roman Kennke <rken...@openjdk.org> wrote:

>> This change replaces the current stack-locking implementation with a 
>> fast-locking scheme that retains the advantages of stack-locking (namely 
>> fast locking in uncontended code-paths), while avoiding the overload of the 
>> mark word. That overloading causes massive problems with Lilliput, because 
>> it means we have to check and deal with this situation. And because of the 
>> very racy nature, this turns out to be very complex and involved a variant 
>> of the inflation protocol to ensure that the object header is stable. 
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations. The lock-stack is 
>> also a new set of GC roots, and would be scanned during thread scanning, 
>> possibly concurrently, via the normal protocols.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation protocol. 
>> This would be implemented in a separate PR
>> 
>> ### Benchmarks
>> 
>> All benchmarks are run on server-class metal machines. The JVM settings are 
>> always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, 
>> less is better.
>> 
>> #### DaCapo/AArch64
>> 
>> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an 
>> AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 
>> 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to 
>> exclude cassandra, h2o & kafka benchmarks because of incompatibility with 
>> JDK20. Benchmarks that showed results far off the baseline or showed high 
>> variance have been repeated and I am reporting results with the most bias 
>> *against* fast-locking. The sunflow benchmark is really far off the mark - 
>> the baseline run with stack-locking exhibited very high run-to-run variance 
>> and generally much worse performance, while with fast-locking the variance 
>> was very low and the results very stable between runs. I wouldn't trust that 
>> benchmark - I mean what is it actually doing that a change in locking shows 
>> >30% perf difference?
>> 
>> benchmark | baseline | fast-locking | % | size
>> -- | -- | -- | -- | --
>> avrora | 27859 | 27563 | 1.07% | large
>> batik | 20786 | 20847 | -0.29% | large
>> biojava | 27421 | 27334 | 0.32% | default
>> eclipse | 59918 | 60522 | -1.00% | large
>> fop | 3670 | 3678 | -0.22% | default
>> graphchi | 2088 | 2060 | 1.36% | default
>> h2 | 297391 | 291292 | 2.09% | huge
>> jme | 8762 | 8877 | -1.30% | default
>> jython | 18938 | 18878 | 0.32% | default
>> luindex | 1339 | 1325 | 1.06% | default
>> lusearch | 918 | 936 | -1.92% | default
>> pmd | 58291 | 58423 | -0.23% | large
>> sunflow | 32617 | 24961 | 30.67% | large
>> tomcat | 25481 | 25992 | -1.97% | large
>> tradebeans | 314640 | 311706 | 0.94% | huge
>> tradesoap | 107473 | 110246 | -2.52% | huge
>> xalan | 6047 | 5882 | 2.81% | default
>> zxing | 970 | 926 | 4.75% | default
>> 
>> #### DaCapo/x86_64
>> 
>> The following measurements have been taken on an Intel Xeon Scalable 
>> Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same 
>> settings and considerations as in the measurements above.
>> 
>> benchmark | baseline | fast-Locking | % | size
>> -- | -- | -- | -- | --
>> avrora | 127690 | 126749 | 0.74% | large
>> batik | 12736 | 12641 | 0.75% | large
>> biojava | 15423 | 15404 | 0.12% | default
>> eclipse | 41174 | 41498 | -0.78% | large
>> fop | 2184 | 2172 | 0.55% | default
>> graphchi | 1579 | 1560 | 1.22% | default
>> h2 | 227614 | 230040 | -1.05% | huge
>> jme | 8591 | 8398 | 2.30% | default
>> jython | 13473 | 13356 | 0.88% | default
>> luindex | 824 | 813 | 1.35% | default
>> lusearch | 962 | 968 | -0.62% | default
>> pmd | 40827 | 39654 | 2.96% | large
>> sunflow | 53362 | 43475 | 22.74% | large
>> tomcat | 27549 | 28029 | -1.71% | large
>> tradebeans | 190757 | 190994 | -0.12% | huge
>> tradesoap | 68099 | 67934 | 0.24% | huge
>> xalan | 7969 | 8178 | -2.56% | default
>> zxing | 1176 | 1148 | 2.44% | default
>> 
>> #### Renaissance/AArch64
>> 
>> This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, 
>> with same JVM settings.
>> 
>> benchmark | baseline | fast-locking | %
>> -- | -- | -- | --
>> AkkaUct | 2558.832 | 2513.594 | 1.80%
>> Reactors | 14715.626 | 14311.246 | 2.83%
>> Als | 1851.485 | 1869.622 | -0.97%
>> ChiSquare | 1007.788 | 1003.165 | 0.46%
>> GaussMix | 1157.491 | 1149.969 | 0.65%
>> LogRegression | 717.772 | 733.576 | -2.15%
>> MovieLens | 7916.181 | 8002.226 | -1.08%
>> NaiveBayes | 395.296 | 386.611 | 2.25%
>> PageRank | 4294.939 | 4346.333 | -1.18%
>> FjKmeans | 496.076 | 493.873 | 0.45%
>> FutureGenetic | 2578.504 | 2589.255 | -0.42%
>> Mnemonics | 4898.886 | 4903.689 | -0.10%
>> ParMnemonics | 4260.507 | 4210.121 | 1.20%
>> Scrabble | 139.37 | 138.312 | 0.76%
>> RxScrabble | 320.114 | 322.651 | -0.79%
>> Dotty | 1056.543 | 1068.492 | -1.12%
>> ScalaDoku | 3443.117 | 3449.477 | -0.18%
>> ScalaKmeans | 259.384 | 258.648 | 0.28%
>> Philosophers | 24333.311 | 23438.22 | 3.82%
>> ScalaStmBench7 | 1102.43 | 1115.142 | -1.14%
>> FinagleChirper | 6814.192 | 6853.38 | -0.57%
>> FinagleHttp | 4762.902 | 4807.564 | -0.93%
>> 
>> #### Renaissance/x86_64
>> 
>> benchmark | baseline | fast-locking | %
>> -- | -- | -- | --
>> AkkaUct | 1117.185 | 1116.425 | 0.07%
>> Reactors | 11561.354 | 11812.499 | -2.13%
>> Als | 1580.838 | 1575.318 | 0.35%
>> ChiSquare | 459.601 | 467.109 | -1.61%
>> GaussMix | 705.944 | 685.595 | 2.97%
>> LogRegression | 659.944 | 656.428 | 0.54%
>> MovieLens | 7434.303 | 7592.271 | -2.08%
>> NaiveBayes | 413.482 | 417.369 | -0.93%
>> PageRank | 3259.233 | 3276.589 | -0.53%
>> FjKmeans | 946.429 | 938.991 | 0.79%
>> FutureGenetic | 1760.672 | 1815.272 | -3.01%
>> ParMnemonics | 2016.917 | 2033.101 | -0.80%
>> Scrabble | 147.996 | 150.084 | -1.39%
>> RxScrabble | 177.755 | 177.956 | -0.11%
>> Dotty | 673.754 | 683.919 | -1.49%
>> ScalaDoku | 2193.562 | 1958.419 | 12.01%
>> ScalaKmeans | 165.376 | 168.925 | -2.10%
>> ScalaStmBench7 | 1080.187 | 1049.184 | 2.95%
>> Philosophers | 14268.449 | 13308.87 | 7.21%
>> FinagleChirper | 4722.13 | 4688.3 | 0.72%
>> FinagleHttp | 3497.241 | 3605.118 | -2.99%
>> 
>> Some renaissance benchmarks are missing: DecTree, DbShootout and 
>> Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show 
>> very high run-to-run variance, which I am investigating (and probably 
>> addressing with running them much more often.
>> 
>> I have also run another benchmark, which is a popular Java JVM benchmark, 
>> with workloads wrapped in JMH and very slightly modified to run with newer 
>> JDKs, but I won't publish the results because I am not sure about the 
>> licensing terms. They look similar to the measurements above (i.e. +/- 2%, 
>> nothing very suspicious).
>> 
>> Please let me know if you want me to run any other workloads, or, even 
>> better, run them yourself and report here.
>> 
>> ### Testing
>>  - [x] tier1 (x86_64, aarch64, x86_32)
>>  - [x] tier2 (x86_64, aarch64)
>>  - [x] tier3 (x86_64, aarch64)
>>  - [x] tier4 (x86_64, aarch64)
>>  - [x] jcstress 3-days -t sync -af GLOBAL (x86_64, aarch64)
>
> Roman Kennke has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 37 commits:
> 
>  - Merge remote-tracking branch 'upstream/master' into fast-locking
>  - Merge remote-tracking branch 'upstream/master' into fast-locking
>  - Merge remote-tracking branch 'upstream/master' into fast-locking
>  - More RISC-V fixes
>  - Merge remote-tracking branch 'origin/fast-locking' into fast-locking
>  - RISC-V port
>  - Revert "Re-use r0 in call to unlock_object()"
>    
>    This reverts commit ebbcb615a788998596f403b47b72cf133cb9de46.
>  - Merge remote-tracking branch 'origin/fast-locking' into fast-locking
>  - Fix number of rt args to complete_monitor_locking_C, remove some comments
>  - Re-use r0 in call to unlock_object()
>  - ... and 27 more: https://git.openjdk.org/jdk/compare/4b89fce0...3f0acba4

FYI: I am working on an alternative PR for this that makes fast-locking 
optional and opt-in behind an experimental switch. It will also be much less 
invasive (no structural changes except absolutely necessary, no cleanups) and 
thus easier to handle.

-------------

PR: https://git.openjdk.org/jdk/pull/10590

Reply via email to