On Mon, 15 Jul 2024 11:33:30 GMT, Jorn Vernee <jver...@openjdk.org> wrote:
>> This PR limits the number of cases in which we deoptimize frames when >> closing a shared Arena. The initial intent of this was to improve the >> performance of shared arena closure in cases where a lot of threads are >> accessing and closing shared arenas at the same time (see attached >> benchmark), but unfortunately even disabling deoptimization altogether does >> not have a great effect on that benchmark. >> >> Nevertheless, I think the extra logging/testing/benchmark code, and comments >> I've written, together with reducing the number of cases where we deoptimize >> (which makes it clearer exactly why we need to deoptimize in the first >> place), will be useful going forward. So, I've a create this PR out of them. >> >> In this PR: >> - Deoptimizing is now only done in cases where it's needed, instead of >> always. Which is in cases where we are not inside an `@Scoped` method, but >> are inside a compiled frame that has a scoped access somewhere inside of it. >> - I've separated the stack walking code (`for_scope_method`) from the code >> that checks for a reference to the arena being closed >> (`is_accessing_session`), and added logging code to the former. That also >> required changing vframe code to accept an `ouputStream*` rather than always >> printing to `tty`. >> - Added a new test (`TestConcurrentClose`), that tries to close many shared >> arenas at the same time, in order to stress that use case. >> - Added a new benchmark (`ConcurrentClose`), that stresses the cases where >> many threads are accessing and closing shared arenas. >> >> I've done several benchmark runs with different amounts of threads. The >> confined case stays much more consistent, while the shared cases balloons up >> in time spent quickly when there are more than 4 threads: >> >> >> Benchmark Threads Mode Cnt Score Error Units >> ConcurrentClose.sharedAccess 32 avgt 10 9017.397 ± 202.870 us/op >> ConcurrentClose.sharedAccess 24 avgt 10 5178.214 ± 164.922 us/op >> ConcurrentClose.sharedAccess 16 avgt 10 2224.420 ± 165.754 us/op >> ConcurrentClose.sharedAccess 8 avgt 10 593.828 ± 8.321 us/op >> ConcurrentClose.sharedAccess 7 avgt 10 470.700 ± 22.511 us/op >> ConcurrentClose.sharedAccess 6 avgt 10 386.697 ± 59.170 us/op >> ConcurrentClose.sharedAccess 5 avgt 10 291.157 ± 7.023 us/op >> ConcurrentClose.sharedAccess 4 avgt 10 209.178 ± 5.802 us/op >> ConcurrentClose.sharedAccess 1 avgt 10 ... > > Jorn Vernee has updated the pull request incrementally with one additional > commit since the last revision: > > improve benchmark I've update the benchmark to run with 3 separate threads: 1 thread that is just creating and closing shared arenas in a loop, 1 that is accessing memory using the FFM API, and 1 that is accessing a `byte[]`. Current: Benchmark Mode Cnt Score Error Units ConcurrentClose.sharedClose avgt 10 50.093 ± 6.200 us/op ConcurrentClose.sharedClose:closing avgt 10 46.269 ± 0.786 us/op ConcurrentClose.sharedClose:memorySegmentAccess avgt 10 98.072 ± 19.061 us/op ConcurrentClose.sharedClose:otherAccess avgt 10 5.938 ± 0.058 us/op I do see a pretty big difference on the memory segment accessing thread when I remove deoptimization altogether: Benchmark Mode Cnt Score Error Units ConcurrentClose.sharedClose avgt 10 22.664 ± 0.409 us/op ConcurrentClose.sharedClose:closing avgt 10 45.351 ± 1.554 us/op ConcurrentClose.sharedClose:memorySegmentAccess avgt 10 16.671 ± 0.251 us/op ConcurrentClose.sharedClose:otherAccess avgt 10 5.969 ± 0.089 us/op When I remove the `has_scoped_access()` check before the deopt, I expect the `otherAccess` thread to be affected, but the effect isn't nearly as big as with the FFM thread. I think this is likely due to the `otherAccess` benchmark being less sensitive to optimization (i.e. it already runs fairly fast in the interpreter). I also tried using `MethodHandles::arrayElementGetter` for the access, but the numbers I got were pretty much the same: Benchmark Mode Cnt Score Error Units ConcurrentClose.sharedClose avgt 10 52.745 ± 1.071 us/op ConcurrentClose.sharedClose:closing avgt 10 46.670 ± 0.453 us/op ConcurrentClose.sharedClose:memorySegmentAccess avgt 10 102.663 ± 3.430 us/op ConcurrentClose.sharedClose:otherAccess avgt 10 8.901 ± 0.109 us/op I think, to really test the effect of the `has_scoped_access` check, we need to look at a more realistic scenario. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20158#issuecomment-2228311368