On Mon, 15 Jul 2024 11:33:30 GMT, Jorn Vernee <jver...@openjdk.org> wrote:

>> This PR limits the number of cases in which we deoptimize frames when 
>> closing a shared Arena. The initial intent of this was to improve the 
>> performance of shared arena closure in cases where a lot of threads are 
>> accessing and closing shared arenas at the same time (see attached 
>> benchmark), but unfortunately even disabling deoptimization altogether does 
>> not have a great effect on that benchmark.
>> 
>> Nevertheless, I think the extra logging/testing/benchmark code, and comments 
>> I've written, together with reducing the number of cases where we deoptimize 
>> (which makes it clearer exactly why we need to deoptimize in the first 
>> place), will be useful going forward. So, I've a create this PR out of them.
>> 
>> In this PR:
>> - Deoptimizing is now only done in cases where it's needed, instead of 
>> always. Which is in cases where we are not inside an `@Scoped` method, but 
>> are inside a compiled frame that has a scoped access somewhere inside of it.
>> - I've separated the stack walking code (`for_scope_method`) from the code 
>> that checks for a reference to the arena being closed 
>> (`is_accessing_session`), and added logging code to the former. That also 
>> required changing vframe code to accept an `ouputStream*` rather than always 
>> printing to `tty`.
>> - Added a new test (`TestConcurrentClose`), that tries to close many shared 
>> arenas at the same time, in order to stress that use case.
>> - Added a new benchmark (`ConcurrentClose`), that stresses the cases where 
>> many threads are accessing and closing shared arenas.
>> 
>> I've done several benchmark runs with different amounts of threads. The 
>> confined case stays much more consistent, while the shared cases balloons up 
>> in time spent quickly when there are more than 4 threads:
>> 
>> 
>> Benchmark                     Threads   Mode  Cnt     Score     Error  Units
>> ConcurrentClose.sharedAccess       32   avgt   10  9017.397 ± 202.870  us/op
>> ConcurrentClose.sharedAccess       24   avgt   10  5178.214 ± 164.922  us/op
>> ConcurrentClose.sharedAccess       16   avgt   10  2224.420 ± 165.754  us/op
>> ConcurrentClose.sharedAccess        8   avgt   10   593.828 ±   8.321  us/op
>> ConcurrentClose.sharedAccess        7   avgt   10   470.700 ±  22.511  us/op
>> ConcurrentClose.sharedAccess        6   avgt   10   386.697 ±  59.170  us/op
>> ConcurrentClose.sharedAccess        5   avgt   10   291.157 ±   7.023  us/op
>> ConcurrentClose.sharedAccess        4   avgt   10   209.178 ±   5.802  us/op
>> ConcurrentClose.sharedAccess        1   avgt   10  ...
>
> Jorn Vernee has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   improve benchmark

I've update the benchmark to run with 3 separate threads: 1 thread that is just 
creating and closing shared arenas in a loop, 1 that is accessing memory using 
the FFM API, and 1 that is accessing a `byte[]`.

Current:

Benchmark                                        Mode  Cnt   Score    Error  
Units
ConcurrentClose.sharedClose                      avgt   10  50.093 ±  6.200  
us/op
ConcurrentClose.sharedClose:closing              avgt   10  46.269 ±  0.786  
us/op
ConcurrentClose.sharedClose:memorySegmentAccess  avgt   10  98.072 ± 19.061  
us/op
ConcurrentClose.sharedClose:otherAccess          avgt   10   5.938 ±  0.058  
us/op


I do see a pretty big difference on the memory segment accessing thread when I 
remove deoptimization altogether:


Benchmark                                        Mode  Cnt   Score   Error  
Units
ConcurrentClose.sharedClose                      avgt   10  22.664 ± 0.409  
us/op
ConcurrentClose.sharedClose:closing              avgt   10  45.351 ± 1.554  
us/op
ConcurrentClose.sharedClose:memorySegmentAccess  avgt   10  16.671 ± 0.251  
us/op
ConcurrentClose.sharedClose:otherAccess          avgt   10   5.969 ± 0.089  
us/op


When I remove the `has_scoped_access()` check before the deopt, I expect the 
`otherAccess` thread to be affected, but the effect isn't nearly as big as with 
the FFM thread. I think this is likely due to the `otherAccess` benchmark being 
less sensitive to optimization (i.e. it already runs fairly fast in the 
interpreter). I also tried using `MethodHandles::arrayElementGetter` for the 
access, but the numbers I got were pretty much the same:


Benchmark                                        Mode  Cnt    Score   Error  
Units
ConcurrentClose.sharedClose                      avgt   10   52.745 ± 1.071  
us/op
ConcurrentClose.sharedClose:closing              avgt   10   46.670 ± 0.453  
us/op
ConcurrentClose.sharedClose:memorySegmentAccess  avgt   10  102.663 ± 3.430  
us/op
ConcurrentClose.sharedClose:otherAccess          avgt   10    8.901 ± 0.109  
us/op


I think, to really test the effect of the `has_scoped_access` check, we need to 
look at a more realistic scenario.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20158#issuecomment-2228311368

Reply via email to