Re: RFR: 8291555: Replace stack-locking with fast-locking [v8]

John R Rose Mon, 14 Nov 2022 15:17:33 -0800

On Fri, 28 Oct 2022 09:32:58 GMT, Roman Kennke <rken...@openjdk.org> wrote:


>> This change replaces the current stack-locking implementation with a 
>> fast-locking scheme that retains the advantages of stack-locking (namely 
>> fast locking in uncontended code-paths), while avoiding the overload of the 
>> mark word. That overloading causes massive problems with Lilliput, because 
>> it means we have to check and deal with this situation. And because of the 
>> very racy nature, this turns out to be very complex and involved a variant 
>> of the inflation protocol to ensure that the object header is stable. 
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations. The lock-stack is 
>> also a new set of GC roots, and would be scanned during thread scanning, 
>> possibly concurrently, via the normal protocols.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation protocol. 
>> This would be implemented in a separate PR
>> 
>> ### Benchmarks
>> 
>> All benchmarks are run on server-class metal machines. The JVM settings are 
>> always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, 
>> less is better.
>> 
>> #### DaCapo/AArch64
>> 
>> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an 
>> AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 
>> 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to 
>> exclude cassandra, h2o & kafka benchmarks because of incompatibility with 
>> JDK20. Benchmarks that showed results far off the baseline or showed high 
>> variance have been repeated and I am reporting results with the most bias 
>> *against* fast-locking. The sunflow benchmark is really far off the mark - 
>> the baseline run with stack-locking exhibited very high run-to-run variance 
>> and generally much worse performance, while with fast-locking the variance 
>> was very low and the results very stable between runs. I wouldn't trust that 
>> benchmark - I mean what is it actually doing that a change in locking shows 
>> >30% perf difference?
>> 
>> benchmark | baseline | fast-locking | % | size
>> -- | -- | -- | -- | --
>> avrora | 27859 | 27563 | 1.07% | large
>> batik | 20786 | 20847 | -0.29% | large
>> biojava | 27421 | 27334 | 0.32% | default
>> eclipse | 59918 | 60522 | -1.00% | large
>> fop | 3670 | 3678 | -0.22% | default
>> graphchi | 2088 | 2060 | 1.36% | default
>> h2 | 297391 | 291292 | 2.09% | huge
>> jme | 8762 | 8877 | -1.30% | default
>> jython | 18938 | 18878 | 0.32% | default
>> luindex | 1339 | 1325 | 1.06% | default
>> lusearch | 918 | 936 | -1.92% | default
>> pmd | 58291 | 58423 | -0.23% | large
>> sunflow | 32617 | 24961 | 30.67% | large
>> tomcat | 25481 | 25992 | -1.97% | large
>> tradebeans | 314640 | 311706 | 0.94% | huge
>> tradesoap | 107473 | 110246 | -2.52% | huge
>> xalan | 6047 | 5882 | 2.81% | default
>> zxing | 970 | 926 | 4.75% | default
>> 
>> #### DaCapo/x86_64
>> 
>> The following measurements have been taken on an Intel Xeon Scalable 
>> Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same 
>> settings and considerations as in the measurements above.
>> 
>> benchmark | baseline | fast-Locking | % | size
>> -- | -- | -- | -- | --
>> avrora | 127690 | 126749 | 0.74% | large
>> batik | 12736 | 12641 | 0.75% | large
>> biojava | 15423 | 15404 | 0.12% | default
>> eclipse | 41174 | 41498 | -0.78% | large
>> fop | 2184 | 2172 | 0.55% | default
>> graphchi | 1579 | 1560 | 1.22% | default
>> h2 | 227614 | 230040 | -1.05% | huge
>> jme | 8591 | 8398 | 2.30% | default
>> jython | 13473 | 13356 | 0.88% | default
>> luindex | 824 | 813 | 1.35% | default
>> lusearch | 962 | 968 | -0.62% | default
>> pmd | 40827 | 39654 | 2.96% | large
>> sunflow | 53362 | 43475 | 22.74% | large
>> tomcat | 27549 | 28029 | -1.71% | large
>> tradebeans | 190757 | 190994 | -0.12% | huge
>> tradesoap | 68099 | 67934 | 0.24% | huge
>> xalan | 7969 | 8178 | -2.56% | default
>> zxing | 1176 | 1148 | 2.44% | default
>> 
>> #### Renaissance/AArch64
>> 
>> This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, 
>> with same JVM settings.
>> 
>> benchmark | baseline | fast-locking | %
>> -- | -- | -- | --
>> AkkaUct | 2558.832 | 2513.594 | 1.80%
>> Reactors | 14715.626 | 14311.246 | 2.83%
>> Als | 1851.485 | 1869.622 | -0.97%
>> ChiSquare | 1007.788 | 1003.165 | 0.46%
>> GaussMix | 1157.491 | 1149.969 | 0.65%
>> LogRegression | 717.772 | 733.576 | -2.15%
>> MovieLens | 7916.181 | 8002.226 | -1.08%
>> NaiveBayes | 395.296 | 386.611 | 2.25%
>> PageRank | 4294.939 | 4346.333 | -1.18%
>> FjKmeans | 496.076 | 493.873 | 0.45%
>> FutureGenetic | 2578.504 | 2589.255 | -0.42%
>> Mnemonics | 4898.886 | 4903.689 | -0.10%
>> ParMnemonics | 4260.507 | 4210.121 | 1.20%
>> Scrabble | 139.37 | 138.312 | 0.76%
>> RxScrabble | 320.114 | 322.651 | -0.79%
>> Dotty | 1056.543 | 1068.492 | -1.12%
>> ScalaDoku | 3443.117 | 3449.477 | -0.18%
>> ScalaKmeans | 259.384 | 258.648 | 0.28%
>> Philosophers | 24333.311 | 23438.22 | 3.82%
>> ScalaStmBench7 | 1102.43 | 1115.142 | -1.14%
>> FinagleChirper | 6814.192 | 6853.38 | -0.57%
>> FinagleHttp | 4762.902 | 4807.564 | -0.93%
>> 
>> #### Renaissance/x86_64
>> 
>> benchmark | baseline | fast-locking | %
>> -- | -- | -- | --
>> AkkaUct | 1117.185 | 1116.425 | 0.07%
>> Reactors | 11561.354 | 11812.499 | -2.13%
>> Als | 1580.838 | 1575.318 | 0.35%
>> ChiSquare | 459.601 | 467.109 | -1.61%
>> GaussMix | 705.944 | 685.595 | 2.97%
>> LogRegression | 659.944 | 656.428 | 0.54%
>> MovieLens | 7434.303 | 7592.271 | -2.08%
>> NaiveBayes | 413.482 | 417.369 | -0.93%
>> PageRank | 3259.233 | 3276.589 | -0.53%
>> FjKmeans | 946.429 | 938.991 | 0.79%
>> FutureGenetic | 1760.672 | 1815.272 | -3.01%
>> ParMnemonics | 2016.917 | 2033.101 | -0.80%
>> Scrabble | 147.996 | 150.084 | -1.39%
>> RxScrabble | 177.755 | 177.956 | -0.11%
>> Dotty | 673.754 | 683.919 | -1.49%
>> ScalaDoku | 2193.562 | 1958.419 | 12.01%
>> ScalaKmeans | 165.376 | 168.925 | -2.10%
>> ScalaStmBench7 | 1080.187 | 1049.184 | 2.95%
>> Philosophers | 14268.449 | 13308.87 | 7.21%
>> FinagleChirper | 4722.13 | 4688.3 | 0.72%
>> FinagleHttp | 3497.241 | 3605.118 | -2.99%
>> 
>> Some renaissance benchmarks are missing: DecTree, DbShootout and 
>> Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show 
>> very high run-to-run variance, which I am investigating (and probably 
>> addressing with running them much more often.
>> 
>> I have also run another benchmark, which is a popular Java JVM benchmark, 
>> with workloads wrapped in JMH and very slightly modified to run with newer 
>> JDKs, but I won't publish the results because I am not sure about the 
>> licensing terms. They look similar to the measurements above (i.e. +/- 2%, 
>> nothing very suspicious).
>> 
>> Please let me know if you want me to run any other workloads, or, even 
>> better, run them yourself and report here.
>> 
>> ### Testing
>>  - [x] tier1 (x86_64, aarch64, x86_32)
>>  - [x] tier2 (x86_64, aarch64)
>>  - [x] tier3 (x86_64, aarch64)
>>  - [x] tier4 (x86_64, aarch64)
>>  - [x] jcstress 3-days -t sync -af GLOBAL (x86_64, aarch64)
>
> Roman Kennke has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 37 commits:
> 
>  - Merge remote-tracking branch 'upstream/master' into fast-locking
>  - Merge remote-tracking branch 'upstream/master' into fast-locking
>  - Merge remote-tracking branch 'upstream/master' into fast-locking
>  - More RISC-V fixes
>  - Merge remote-tracking branch 'origin/fast-locking' into fast-locking
>  - RISC-V port
>  - Revert "Re-use r0 in call to unlock_object()"
>    
>    This reverts commit ebbcb615a788998596f403b47b72cf133cb9de46.
>  - Merge remote-tracking branch 'origin/fast-locking' into fast-locking
>  - Fix number of rt args to complete_monitor_locking_C, remove some comments
>  - Re-use r0 in call to unlock_object()
>  - ... and 27 more: https://git.openjdk.org/jdk/compare/4b89fce0...3f0acba4

FTR I agree with Holmes that a conditional opt-in is better.  While we are 
uncertain of the viability of the new scheme (which FTR I like!) for all our 
customers, we need to have a dynamic selection of the technique, so we can turn 
it on and off.

Off by default at first, then later on by default, then on with no option at 
all if all goes well (which I hope it does).  Perhaps Lilliput can have it 
turned on by default, and throw an error if (for some reason) the user tries to 
turn it off again.

That's the way we phased in, and then phased out, biased locking, and it seems 
to me that this is a closely similar situation.  Eventually, if all goes well, 
we can remove the stack locking code, as we did with biased locking.

For more details about that long-running saga, one might look at the history of 
`-XX:+UseBiasedLocking` in the source base, perhaps starting with `$ git log -S 
UseBiasedLocking`.

-------------

PR: https://git.openjdk.org/jdk/pull/10590

Re: RFR: 8291555: Replace stack-locking with fast-locking [v8]

Reply via email to