Re: RFR: 8291555: Replace stack-locking with fast-locking [v7]

Roman Kennke Mon, 24 Oct 2022 01:03:40 -0700

> This change replaces the current stack-locking implementation with a 
> fast-locking scheme that retains the advantages of stack-locking (namely fast 
> locking in uncontended code-paths), while avoiding the overload of the mark 
> word. That overloading causes massive problems with Lilliput, because it 
> means we have to check and deal with this situation. And because of the very 
> racy nature, this turns out to be very complex and involved a variant of the 
> inflation protocol to ensure that the object header is stable. 
> 
> What the original stack-locking does is basically to push a stack-lock onto 
> the stack which consists only of the displaced header, and CAS a pointer to 
> this stack location into the object header (the lowest two header bits being 
> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
> identify which thread currently owns the lock.
> 
> This change basically reverses stack-locking: It still CASes the lowest two 
> header bits to 00 to indicate 'fast-locked' but does *not* overload the upper 
> bits with a stack-pointer. Instead, it pushes the object-reference to a 
> thread-local lock-stack. This is a new structure which is basically a small 
> array of oops that is associated with each thread. Experience shows that this 
> array typcially remains very small (3-5 elements). Using this lock stack, it 
> is possible to query which threads own which locks. Most importantly, the 
> most common question 'does the current thread own me?' is very quickly 
> answered by doing a quick scan of the array. More complex queries like 'which 
> thread owns X?' are not performed in very performance-critical paths (usually 
> in code like JVMTI or deadlock detection) where it is ok to do more complex 
> operations. The lock-stack is also a new set of GC roots, and would be 
> scanned during thread scanning, possibly concurrently, via the normal 
> protocols.
> 
> In contrast to stack-locking, fast-locking does *not* support recursive 
> locking (yet). When that happens, the fast-lock gets inflated to a full 
> monitor. It is not clear if it is worth to add support for recursive 
> fast-locking.
> 
> One trouble is that when a contending thread arrives at a fast-locked object, 
> it must inflate the fast-lock to a full monitor. Normally, we need to know 
> the current owning thread, and record that in the monitor, so that the 
> contending thread can wait for the current owner to properly exit the 
> monitor. However, fast-locking doesn't have this information. What we do 
> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
> currently holds the lock arrives at monitorexit, and observes 
> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
> and then properly exits the monitor, and thus handing over to the contending 
> thread.
> 
> As an alternative, I considered to remove stack-locking altogether, and only 
> use heavy monitors. In most workloads this did not show measurable 
> regressions. However, in a few workloads, I have observed severe regressions. 
> All of them have been using old synchronized Java collections (Vector, 
> Stack), StringBuffer or similar code. The combination of two conditions leads 
> to regressions without stack- or fast-locking: 1. The workload synchronizes 
> on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 
> 2. The workload churns such locks. IOW, uncontended use of Vector, 
> StringBuffer, etc as such is ok, but creating lots of such single-use, 
> single-threaded-locked objects leads to massive ObjectMonitor churn, which 
> can lead to a significant performance impact. But alas, such code exists, and 
> we probably don't want to punish it if we can avoid it.
> 
> This change enables to simplify (and speed-up!) a lot of code:
> 
> - The inflation protocol is no longer necessary: we can directly CAS the 
> (tagged) ObjectMonitor pointer to the object header.
> - Accessing the hashcode could now be done in the fastpath always, if the 
> hashcode has been installed. Fast-locked headers can be used directly, for 
> monitor-locked objects we can easily reach-through to the displaced header. 
> This is safe because Java threads participate in monitor deflation protocol. 
> This would be implemented in a separate PR
> 
> ### Benchmarks
> 
> All benchmarks are run on server-class metal machines. The JVM settings are 
> always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less 
> is better.
> 
> #### DaCapo/AArch64
> 
> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an 
> AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 
> 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to 
> exclude cassandra, h2o & kafka benchmarks because of incompatibility with 
> JDK20. Benchmarks that showed results far off the baseline or showed high 
> variance have been repeated and I am reporting results with the most bias 
> *against* fast-locking. The sunflow benchmark is really far off the mark - 
> the baseline run with stack-locking exhibited very high run-to-run variance 
> and generally much worse performance, while with fast-locking the variance 
> was very low and the results very stable between runs. I wouldn't trust that 
> benchmark - I mean what is it actually doing that a change in locking shows 
> >30% perf difference?
> 
> benchmark | baseline | fast-locking | % | size
> -- | -- | -- | -- | --
> avrora | 27859 | 27563 | 1.07% | large
> batik | 20786 | 20847 | -0.29% | large
> biojava | 27421 | 27334 | 0.32% | default
> eclipse | 59918 | 60522 | -1.00% | large
> fop | 3670 | 3678 | -0.22% | default
> graphchi | 2088 | 2060 | 1.36% | default
> h2 | 297391 | 291292 | 2.09% | huge
> jme | 8762 | 8877 | -1.30% | default
> jython | 18938 | 18878 | 0.32% | default
> luindex | 1339 | 1325 | 1.06% | default
> lusearch | 918 | 936 | -1.92% | default
> pmd | 58291 | 58423 | -0.23% | large
> sunflow | 32617 | 24961 | 30.67% | large
> tomcat | 25481 | 25992 | -1.97% | large
> tradebeans | 314640 | 311706 | 0.94% | huge
> tradesoap | 107473 | 110246 | -2.52% | huge
> xalan | 6047 | 5882 | 2.81% | default
> zxing | 970 | 926 | 4.75% | default
> 
> #### DaCapo/x86_64
> 
> The following measurements have been taken on an Intel Xeon Scalable 
> Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same 
> settings and considerations as in the measurements above.
> 
> benchmark | baseline | fast-Locking | % | size
> -- | -- | -- | -- | --
> avrora | 127690 | 126749 | 0.74% | large
> batik | 12736 | 12641 | 0.75% | large
> biojava | 15423 | 15404 | 0.12% | default
> eclipse | 41174 | 41498 | -0.78% | large
> fop | 2184 | 2172 | 0.55% | default
> graphchi | 1579 | 1560 | 1.22% | default
> h2 | 227614 | 230040 | -1.05% | huge
> jme | 8591 | 8398 | 2.30% | default
> jython | 13473 | 13356 | 0.88% | default
> luindex | 824 | 813 | 1.35% | default
> lusearch | 962 | 968 | -0.62% | default
> pmd | 40827 | 39654 | 2.96% | large
> sunflow | 53362 | 43475 | 22.74% | large
> tomcat | 27549 | 28029 | -1.71% | large
> tradebeans | 190757 | 190994 | -0.12% | huge
> tradesoap | 68099 | 67934 | 0.24% | huge
> xalan | 7969 | 8178 | -2.56% | default
> zxing | 1176 | 1148 | 2.44% | default
> 
> #### Renaissance/AArch64
> 
> This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, 
> with same JVM settings.
> 
> benchmark | baseline | fast-locking | %
> -- | -- | -- | --
> AkkaUct | 2558.832 | 2513.594 | 1.80%
> Reactors | 14715.626 | 14311.246 | 2.83%
> Als | 1851.485 | 1869.622 | -0.97%
> ChiSquare | 1007.788 | 1003.165 | 0.46%
> GaussMix | 1157.491 | 1149.969 | 0.65%
> LogRegression | 717.772 | 733.576 | -2.15%
> MovieLens | 7916.181 | 8002.226 | -1.08%
> NaiveBayes | 395.296 | 386.611 | 2.25%
> PageRank | 4294.939 | 4346.333 | -1.18%
> FjKmeans | 496.076 | 493.873 | 0.45%
> FutureGenetic | 2578.504 | 2589.255 | -0.42%
> Mnemonics | 4898.886 | 4903.689 | -0.10%
> ParMnemonics | 4260.507 | 4210.121 | 1.20%
> Scrabble | 139.37 | 138.312 | 0.76%
> RxScrabble | 320.114 | 322.651 | -0.79%
> Dotty | 1056.543 | 1068.492 | -1.12%
> ScalaDoku | 3443.117 | 3449.477 | -0.18%
> ScalaKmeans | 259.384 | 258.648 | 0.28%
> Philosophers | 24333.311 | 23438.22 | 3.82%
> ScalaStmBench7 | 1102.43 | 1115.142 | -1.14%
> FinagleChirper | 6814.192 | 6853.38 | -0.57%
> FinagleHttp | 4762.902 | 4807.564 | -0.93%
> 
> #### Renaissance/x86_64
> 
> benchmark | baseline | fast-locking | %
> -- | -- | -- | --
> AkkaUct | 1117.185 | 1116.425 | 0.07%
> Reactors | 11561.354 | 11812.499 | -2.13%
> Als | 1580.838 | 1575.318 | 0.35%
> ChiSquare | 459.601 | 467.109 | -1.61%
> GaussMix | 705.944 | 685.595 | 2.97%
> LogRegression | 659.944 | 656.428 | 0.54%
> MovieLens | 7434.303 | 7592.271 | -2.08%
> NaiveBayes | 413.482 | 417.369 | -0.93%
> PageRank | 3259.233 | 3276.589 | -0.53%
> FjKmeans | 946.429 | 938.991 | 0.79%
> FutureGenetic | 1760.672 | 1815.272 | -3.01%
> ParMnemonics | 2016.917 | 2033.101 | -0.80%
> Scrabble | 147.996 | 150.084 | -1.39%
> RxScrabble | 177.755 | 177.956 | -0.11%
> Dotty | 673.754 | 683.919 | -1.49%
> ScalaDoku | 2193.562 | 1958.419 | 12.01%
> ScalaKmeans | 165.376 | 168.925 | -2.10%
> ScalaStmBench7 | 1080.187 | 1049.184 | 2.95%
> Philosophers | 14268.449 | 13308.87 | 7.21%
> FinagleChirper | 4722.13 | 4688.3 | 0.72%
> FinagleHttp | 3497.241 | 3605.118 | -2.99%
> 
> Some renaissance benchmarks are missing: DecTree, DbShootout and 
> Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show 
> very high run-to-run variance, which I am investigating (and probably 
> addressing with running them much more often.
> 
> I have also run another benchmark, which is a popular Java JVM benchmark, 
> with workloads wrapped in JMH and very slightly modified to run with newer 
> JDKs, but I won't publish the results because I am not sure about the 
> licensing terms. They look similar to the measurements above (i.e. +/- 2%, 
> nothing very suspicious).
> 
> Please let me know if you want me to run any other workloads, or, even 
> better, run them yourself and report here.
> 
> ### Testing
>  - [x] tier1 (x86_64, aarch64, x86_32)
>  - [x] tier2 (x86_64, aarch64)
>  - [x] tier3 (x86_64, aarch64)
>  - [x] tier4 (x86_64, aarch64)
>  - [x] jcstress 3-days -t sync -af GLOBAL (x86_64, aarch64)


Roman Kennke has updated the pull request with a new target base due to a merge 
or a rebase. The pull request now contains 35 commits:

 - Merge remote-tracking branch 'upstream/master' into fast-locking
 - More RISC-V fixes
 - Merge remote-tracking branch 'origin/fast-locking' into fast-locking
 - RISC-V port
 - Revert "Re-use r0 in call to unlock_object()"
   
   This reverts commit ebbcb615a788998596f403b47b72cf133cb9de46.
 - Merge remote-tracking branch 'origin/fast-locking' into fast-locking
 - Fix number of rt args to complete_monitor_locking_C, remove some comments
 - Re-use r0 in call to unlock_object()
 - Merge tag 'jdk-20+17' into fast-locking
   
   Added tag jdk-20+17 for changeset 79ccc791
 - Fix OSR packing in AArch64, part 2
 - ... and 25 more: https://git.openjdk.org/jdk/compare/65c84e0c...a67eb95e

-------------

Changes: https://git.openjdk.org/jdk/pull/10590/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=06
  Stats: 4031 lines in 137 files changed: 731 ins; 2703 del; 597 mod
  Patch: https://git.openjdk.org/jdk/pull/10590.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/10590/head:pull/10590

PR: https://git.openjdk.org/jdk/pull/10590

Re: RFR: 8291555: Replace stack-locking with fast-locking [v7]

Reply via email to