RFR: 8291555: Replace stack-locking with fast-locking

Roman Kennke Thu, 06 Oct 2022 00:47:48 -0700

This change replaces the current stack-locking implementation with a 
fast-locking scheme that retains the advantages of stack-locking (namely fast 
locking in uncontended code-paths), while avoiding the overload of the mark 
word. That overloading causes massive problems with Lilliput, because it means 
we have to check and deal with this situation. And because of the very racy 
nature, this turns out to be very complex and involved a variant of the 
inflation protocol to ensure that the object header is stable.


What the original stack-locking does is basically to push a stack-lock onto the 
stack which consists only of the displaced header, and CAS a pointer to this 
stack location into the object header (the lowest two header bits being 00 
indicate 'stack-locked'). The pointer into the stack can then be used to 
identify which thread currently owns the lock.

This change basically reverses stack-locking: It still CASes the lowest two 
header bits to 00 to indicate 'fast-locked' but does *not* overload the upper 
bits with a stack-pointer. Instead, it pushes the object-reference to a 
thread-local lock-stack. This is a new structure which is basically a small 
array of oops that is associated with each thread. Experience shows that this 
array typcially remains very small (3-5 elements). Using this lock stack, it is 
possible to query which threads own which locks. Most importantly, the most 
common question 'does the current thread own me?' is very quickly answered by 
doing a quick scan of the array. More complex queries like 'which thread owns 
X?' are not performed in very performance-critical paths (usually in code like 
JVMTI or deadlock detection) where it is ok to do more complex operations. The 
lock-stack is also a new set of GC roots, and would be scanned during thread 
scanning, possibly concurrently, via the normal protocols.

In contrast to stack-locking, fast-locking does *not* support recursive locking 
(yet). When that happens, the fast-lock gets inflated to a full monitor. It is 
not clear if it is worth to add support for recursive fast-locking.

One trouble is that when a contending thread arrives at a fast-locked object, 
it must inflate the fast-lock to a full monitor. Normally, we need to know the 
current owning thread, and record that in the monitor, so that the contending 
thread can wait for the current owner to properly exit the monitor. However, 
fast-locking doesn't have this information. What we do instead is to record a 
special marker ANONYMOUS_OWNER. When the thread that currently holds the lock 
arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be 
itself, fixes the owner to be itself, and then properly exits the monitor, and 
thus handing over to the contending thread.

As an alternative, I considered to remove stack-locking altogether, and only 
use heavy monitors. In most workloads this did not show measurable regressions. 
However, in a few workloads, I have observed severe regressions. All of them 
have been using old synchronized Java collections (Vector, Stack), StringBuffer 
or similar code. The combination of two conditions leads to regressions without 
stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. 
single-threaded use of Vector or StringBuffer) and 2. The workload churns such 
locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but 
creating lots of such single-use, single-threaded-locked objects leads to 
massive ObjectMonitor churn, which can lead to a significant performance 
impact. But alas, such code exists, and we probably don't want to punish it if 
we can avoid it.

This change enables to simplify (and speed-up!) a lot of code:

- The inflation protocol is no longer necessary: we can directly CAS the 
(tagged) ObjectMonitor pointer to the object header.
- Accessing the hashcode could now be done in the fastpath always, if the 
hashcode has been installed. Fast-locked headers can be used directly, for 
monitor-locked objects we can easily reach-through to the displaced header. 
This is safe because Java threads participate in monitor deflation protocol. 
This would be implemented in a separate PR

### Benchmarks

All benchmarks are run on server-class metal machines. The JVM settings are 
always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less 
is better.

#### DaCapo/AArch64

Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS 
m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa 
(download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude 
cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. 
Benchmarks that showed results far off the baseline or showed high variance 
have been repeated and I am reporting results with the most bias *against* 
fast-locking. The sunflow benchmark is really far off the mark - the baseline 
run with stack-locking exhibited very high run-to-run variance and generally 
much worse performance, while with fast-locking the variance was very low and 
the results very stable between runs. I wouldn't trust that benchmark - I mean 
what is it actually doing that a change in locking shows >30% perf difference?

benchmark | baseline | fast-locking | % | size
-- | -- | -- | -- | --
avrora | 27859 | 27563 | 1.07% | large
batik | 20786 | 20847 | -0.29% | large
biojava | 27421 | 27334 | 0.32% | default
eclipse | 59918 | 60522 | -1.00% | large
fop | 3670 | 3678 | -0.22% | default
graphchi | 2088 | 2060 | 1.36% | default
h2 | 297391 | 291292 | 2.09% | huge
jme | 8762 | 8877 | -1.30% | default
jython | 18938 | 18878 | 0.32% | default
luindex | 1339 | 1325 | 1.06% | default
lusearch | 918 | 936 | -1.92% | default
pmd | 58291 | 58423 | -0.23% | large
sunflow | 32617 | 24961 | 30.67% | large
tomcat | 25481 | 25992 | -1.97% | large
tradebeans | 314640 | 311706 | 0.94% | huge
tradesoap | 107473 | 110246 | -2.52% | huge
xalan | 6047 | 5882 | 2.81% | default
zxing | 970 | 926 | 4.75% | default

#### DaCapo/x86_64

The following measurements have been taken on an Intel Xeon Scalable Processors 
(Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and 
considerations as in the measurements above.

benchmark | baseline | fast-Locking | % | size
-- | -- | -- | -- | --
avrora | 127690 | 126749 | 0.74% | large
batik | 12736 | 12641 | 0.75% | large
biojava | 15423 | 15404 | 0.12% | default
eclipse | 41174 | 41498 | -0.78% | large
fop | 2184 | 2172 | 0.55% | default
graphchi | 1579 | 1560 | 1.22% | default
h2 | 227614 | 230040 | -1.05% | huge
jme | 8591 | 8398 | 2.30% | default
jython | 13473 | 13356 | 0.88% | default
luindex | 824 | 813 | 1.35% | default
lusearch | 962 | 968 | -0.62% | default
pmd | 40827 | 39654 | 2.96% | large
sunflow | 53362 | 43475 | 22.74% | large
tomcat | 27549 | 28029 | -1.71% | large
tradebeans | 190757 | 190994 | -0.12% | huge
tradesoap | 68099 | 67934 | 0.24% | huge
xalan | 7969 | 8178 | -2.56% | default
zxing | 1176 | 1148 | 2.44% | default

#### Renaissance/AArch64

This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, 
with same JVM settings.

benchmark | baseline | fast-locking | %
-- | -- | -- | --
AkkaUct | 2558.832 | 2513.594 | 1.80%
Reactors | 14715.626 | 14311.246 | 2.83%
Als | 1851.485 | 1869.622 | -0.97%
ChiSquare | 1007.788 | 1003.165 | 0.46%
GaussMix | 1157.491 | 1149.969 | 0.65%
LogRegression | 717.772 | 733.576 | -2.15%
MovieLens | 7916.181 | 8002.226 | -1.08%
NaiveBayes | 395.296 | 386.611 | 2.25%
PageRank | 4294.939 | 4346.333 | -1.18%
FjKmeans | 519.2 | 498.357 | 4.18%
FutureGenetic | 2578.504 | 2589.255 | -0.42%
Mnemonics | 4898.886 | 4903.689 | -0.10%
ParMnemonics | 4260.507 | 4210.121 | 1.20%
Scrabble | 139.37 | 138.312 | 0.76%
RxScrabble | 320.114 | 322.651 | -0.79%
Dotty | 1056.543 | 1068.492 | -1.12%
ScalaDoku | 3443.117 | 3449.477 | -0.18%
ScalaStmBench7 | 1102.43 | 1115.142 | -1.14%
FinagleChirper | 6814.192 | 6853.38 | -0.57%
FinagleHttp | 4762.902 | 4807.564 | -0.93%

#### Renaissance/x86_64

benchmark | baseline | fast-locking | %
-- | -- | -- | --
AkkaUct | 1117.185 | 1116.425 | 0.07%
Reactors | 11561.354 | 11812.499 | -2.13%
Als | 1580.838 | 1575.318 | 0.35%
ChiSquare | 459.601 | 467.109 | -1.61%
GaussMix | 705.944 | 685.595 | 2.97%
LogRegression | 659.944 | 656.428 | 0.54%
MovieLens | 7434.303 | 7592.271 | -2.08%
NaiveBayes | 413.482 | 417.369 | -0.93%
PageRank | 3259.233 | 3276.589 | -0.53%
FjKmeans | 946.429 | 938.991 | 0.79%
FutureGenetic | 1760.672 | 1815.272 | -3.01%
Scrabble | 147.996 | 150.084 | -1.39%
RxScrabble | 177.755 | 177.956 | -0.11%
Dotty | 673.754 | 683.919 | -1.49%
ScalaKmeans | 165.376 | 168.925 | -2.10%
ScalaStmBench7 | 1080.187 | 1049.184 | 2.95%

Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics 
are not compatible with JDK20. The remaining benchmarks show very high 
run-to-run variance, which I am investigating (and probably addressing with 
running them much more often.

Please let me know if you want me to run any other workloads, or, even better, 
run them yourself and report here.

### Testing
 - [x] tier1 (x86_64, aarch64, x86_32)
 - [x] tier2 (x86_64, aarch64)
 - [x] tier3 (x86_64, aarch64)
 - [x] tier4 (x86_64, aarch64)

-------------

Commit messages:
 - Merge tag 'jdk-20+17' into fast-locking
 - Fix OSR packing in AArch64, part 2
 - Fix OSR packing in AArch64
 - Merge remote-tracking branch 'upstream/master' into fast-locking
 - Fix register in interpreter unlock x86_32
 - Support unstructured locking in interpreter (x86 parts)
 - Support unstructured locking in interpreter (aarch64 and shared parts)
 - Merge branch 'master' into fast-locking
 - Merge branch 'master' into fast-locking
 - Added test for hand-over-hand locking
 - ... and 17 more: https://git.openjdk.org/jdk/compare/79ccc791...3ed51053

Changes: https://git.openjdk.org/jdk/pull/9680/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9680&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8291555
  Stats: 3660 lines in 127 files changed: 650 ins; 2481 del; 529 mod
  Patch: https://git.openjdk.org/jdk/pull/9680.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/9680/head:pull/9680

PR: https://git.openjdk.org/jdk/pull/9680

RFR: 8291555: Replace stack-locking with fast-locking

Reply via email to