On Mon, 24 Oct 2022 08:03:13 GMT, Roman Kennke <rken...@openjdk.org> wrote:
>> This change replaces the current stack-locking implementation with a >> fast-locking scheme that retains the advantages of stack-locking (namely >> fast locking in uncontended code-paths), while avoiding the overload of the >> mark word. That overloading causes massive problems with Lilliput, because >> it means we have to check and deal with this situation. And because of the >> very racy nature, this turns out to be very complex and involved a variant >> of the inflation protocol to ensure that the object header is stable. >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations. The lock-stack is >> also a new set of GC roots, and would be scanned during thread scanning, >> possibly concurrently, via the normal protocols. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation protocol. >> This would be implemented in a separate PR >> >> ### Benchmarks >> >> All benchmarks are run on server-class metal machines. The JVM settings are >> always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, >> less is better. >> >> #### DaCapo/AArch64 >> >> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an >> AWS m6g.metal instance). It is using DaCapo evaluation version, git hash >> 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to >> exclude cassandra, h2o & kafka benchmarks because of incompatibility with >> JDK20. Benchmarks that showed results far off the baseline or showed high >> variance have been repeated and I am reporting results with the most bias >> *against* fast-locking. The sunflow benchmark is really far off the mark - >> the baseline run with stack-locking exhibited very high run-to-run variance >> and generally much worse performance, while with fast-locking the variance >> was very low and the results very stable between runs. I wouldn't trust that >> benchmark - I mean what is it actually doing that a change in locking shows >> >30% perf difference? >> >> benchmark | baseline | fast-locking | % | size >> -- | -- | -- | -- | -- >> avrora | 27859 | 27563 | 1.07% | large >> batik | 20786 | 20847 | -0.29% | large >> biojava | 27421 | 27334 | 0.32% | default >> eclipse | 59918 | 60522 | -1.00% | large >> fop | 3670 | 3678 | -0.22% | default >> graphchi | 2088 | 2060 | 1.36% | default >> h2 | 297391 | 291292 | 2.09% | huge >> jme | 8762 | 8877 | -1.30% | default >> jython | 18938 | 18878 | 0.32% | default >> luindex | 1339 | 1325 | 1.06% | default >> lusearch | 918 | 936 | -1.92% | default >> pmd | 58291 | 58423 | -0.23% | large >> sunflow | 32617 | 24961 | 30.67% | large >> tomcat | 25481 | 25992 | -1.97% | large >> tradebeans | 314640 | 311706 | 0.94% | huge >> tradesoap | 107473 | 110246 | -2.52% | huge >> xalan | 6047 | 5882 | 2.81% | default >> zxing | 970 | 926 | 4.75% | default >> >> #### DaCapo/x86_64 >> >> The following measurements have been taken on an Intel Xeon Scalable >> Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same >> settings and considerations as in the measurements above. >> >> benchmark | baseline | fast-Locking | % | size >> -- | -- | -- | -- | -- >> avrora | 127690 | 126749 | 0.74% | large >> batik | 12736 | 12641 | 0.75% | large >> biojava | 15423 | 15404 | 0.12% | default >> eclipse | 41174 | 41498 | -0.78% | large >> fop | 2184 | 2172 | 0.55% | default >> graphchi | 1579 | 1560 | 1.22% | default >> h2 | 227614 | 230040 | -1.05% | huge >> jme | 8591 | 8398 | 2.30% | default >> jython | 13473 | 13356 | 0.88% | default >> luindex | 824 | 813 | 1.35% | default >> lusearch | 962 | 968 | -0.62% | default >> pmd | 40827 | 39654 | 2.96% | large >> sunflow | 53362 | 43475 | 22.74% | large >> tomcat | 27549 | 28029 | -1.71% | large >> tradebeans | 190757 | 190994 | -0.12% | huge >> tradesoap | 68099 | 67934 | 0.24% | huge >> xalan | 7969 | 8178 | -2.56% | default >> zxing | 1176 | 1148 | 2.44% | default >> >> #### Renaissance/AArch64 >> >> This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, >> with same JVM settings. >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 2558.832 | 2513.594 | 1.80% >> Reactors | 14715.626 | 14311.246 | 2.83% >> Als | 1851.485 | 1869.622 | -0.97% >> ChiSquare | 1007.788 | 1003.165 | 0.46% >> GaussMix | 1157.491 | 1149.969 | 0.65% >> LogRegression | 717.772 | 733.576 | -2.15% >> MovieLens | 7916.181 | 8002.226 | -1.08% >> NaiveBayes | 395.296 | 386.611 | 2.25% >> PageRank | 4294.939 | 4346.333 | -1.18% >> FjKmeans | 496.076 | 493.873 | 0.45% >> FutureGenetic | 2578.504 | 2589.255 | -0.42% >> Mnemonics | 4898.886 | 4903.689 | -0.10% >> ParMnemonics | 4260.507 | 4210.121 | 1.20% >> Scrabble | 139.37 | 138.312 | 0.76% >> RxScrabble | 320.114 | 322.651 | -0.79% >> Dotty | 1056.543 | 1068.492 | -1.12% >> ScalaDoku | 3443.117 | 3449.477 | -0.18% >> ScalaKmeans | 259.384 | 258.648 | 0.28% >> Philosophers | 24333.311 | 23438.22 | 3.82% >> ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% >> FinagleChirper | 6814.192 | 6853.38 | -0.57% >> FinagleHttp | 4762.902 | 4807.564 | -0.93% >> >> #### Renaissance/x86_64 >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 1117.185 | 1116.425 | 0.07% >> Reactors | 11561.354 | 11812.499 | -2.13% >> Als | 1580.838 | 1575.318 | 0.35% >> ChiSquare | 459.601 | 467.109 | -1.61% >> GaussMix | 705.944 | 685.595 | 2.97% >> LogRegression | 659.944 | 656.428 | 0.54% >> MovieLens | 7434.303 | 7592.271 | -2.08% >> NaiveBayes | 413.482 | 417.369 | -0.93% >> PageRank | 3259.233 | 3276.589 | -0.53% >> FjKmeans | 946.429 | 938.991 | 0.79% >> FutureGenetic | 1760.672 | 1815.272 | -3.01% >> ParMnemonics | 2016.917 | 2033.101 | -0.80% >> Scrabble | 147.996 | 150.084 | -1.39% >> RxScrabble | 177.755 | 177.956 | -0.11% >> Dotty | 673.754 | 683.919 | -1.49% >> ScalaDoku | 2193.562 | 1958.419 | 12.01% >> ScalaKmeans | 165.376 | 168.925 | -2.10% >> ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% >> Philosophers | 14268.449 | 13308.87 | 7.21% >> FinagleChirper | 4722.13 | 4688.3 | 0.72% >> FinagleHttp | 3497.241 | 3605.118 | -2.99% >> >> Some renaissance benchmarks are missing: DecTree, DbShootout and >> Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show >> very high run-to-run variance, which I am investigating (and probably >> addressing with running them much more often. >> >> I have also run another benchmark, which is a popular Java JVM benchmark, >> with workloads wrapped in JMH and very slightly modified to run with newer >> JDKs, but I won't publish the results because I am not sure about the >> licensing terms. They look similar to the measurements above (i.e. +/- 2%, >> nothing very suspicious). >> >> Please let me know if you want me to run any other workloads, or, even >> better, run them yourself and report here. >> >> ### Testing >> - [x] tier1 (x86_64, aarch64, x86_32) >> - [x] tier2 (x86_64, aarch64) >> - [x] tier3 (x86_64, aarch64) >> - [x] tier4 (x86_64, aarch64) >> - [x] jcstress 3-days -t sync -af GLOBAL (x86_64, aarch64) > > Roman Kennke has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains 35 commits: > > - Merge remote-tracking branch 'upstream/master' into fast-locking > - More RISC-V fixes > - Merge remote-tracking branch 'origin/fast-locking' into fast-locking > - RISC-V port > - Revert "Re-use r0 in call to unlock_object()" > > This reverts commit ebbcb615a788998596f403b47b72cf133cb9de46. > - Merge remote-tracking branch 'origin/fast-locking' into fast-locking > - Fix number of rt args to complete_monitor_locking_C, remove some comments > - Re-use r0 in call to unlock_object() > - Merge tag 'jdk-20+17' into fast-locking > > Added tag jdk-20+17 for changeset 79ccc791 > - Fix OSR packing in AArch64, part 2 > - ... and 25 more: https://git.openjdk.org/jdk/compare/65c84e0c...a67eb95e This PR has been in "merge-conflict" state for about 10 days. When do you plan to merge again with the jdk/jdk repo? ------------- PR: https://git.openjdk.org/jdk/pull/10590