On Fri, 12 Jul 2024 20:59:26 GMT, Jorn Vernee <jver...@openjdk.org> wrote:

>> This PR limits the number of cases in which we deoptimize frames when 
>> closing a shared Arena. The initial intent of this was to improve the 
>> performance of shared arena closure in cases where a lot of threads are 
>> accessing and closing shared arenas at the same time (see attached 
>> benchmark), but unfortunately even disabling deoptimization altogether does 
>> not have a great effect on that benchmark.
>> 
>> Nevertheless, I think the extra logging/testing/benchmark code, and comments 
>> I've written, together with reducing the number of cases where we deoptimize 
>> (which makes it clearer exactly why we need to deoptimize in the first 
>> place), will be useful going forward. So, I've a create this PR out of them.
>> 
>> In this PR:
>> - Deoptimizing is now only done in cases where it's needed, instead of 
>> always. Which is in cases where we are not inside an `@Scoped` method, but 
>> are inside compiled code.
>> - I've separated the stack walking code (`for_scope_method`) from the code 
>> that checks for a reference to the arena being closed 
>> (`is_accessing_session`), and added logging code to the former. That also 
>> required changing vframe code to accept an `ouputStream*` rather than always 
>> printing to `tty`.
>> - Added a new test (`TestConcurrentClose`), that tries to close many shared 
>> arenas at the same time, in order to stress that use case.
>> - Added a new benchmark (`ConcurrentClose`), that stresses the cases where 
>> many threads are accessing and closing shared arenas.
>> 
>> I've done several benchmark runs with different amounts of threads. The 
>> confined case stays much more consistent, while the shared cases balloons up 
>> in time spent quickly when there are more than 4 threads:
>> 
>> 
>> Benchmark                     Threads   Mode  Cnt     Score     Error  Units
>> ConcurrentClose.sharedAccess       32   avgt   10  9017.397 ± 202.870  us/op
>> ConcurrentClose.sharedAccess       24   avgt   10  5178.214 ± 164.922  us/op
>> ConcurrentClose.sharedAccess       16   avgt   10  2224.420 ± 165.754  us/op
>> ConcurrentClose.sharedAccess        8   avgt   10   593.828 ±   8.321  us/op
>> ConcurrentClose.sharedAccess        7   avgt   10   470.700 ±  22.511  us/op
>> ConcurrentClose.sharedAccess        6   avgt   10   386.697 ±  59.170  us/op
>> ConcurrentClose.sharedAccess        5   avgt   10   291.157 ±   7.023  us/op
>> ConcurrentClose.sharedAccess        4   avgt   10   209.178 ±   5.802  us/op
>> ConcurrentClose.sharedAccess        1   avgt   10    52.042 ±   0.630  us/op
>> ConcurrentClose.confine...
>
> Jorn Vernee has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   track has_scoped_access for compiled methods

Nice work !

Thinking a bit about how to improve the benchmark and given the semantics of 
Arena.close(), there is a trick that you can use. There are two kinds of memory 
segments, the one that only visible from Java and the one that are visible not 
only from Java. By example, a memory segment created from a mmap or a memory 
segment with an address sent to a C code are visible from outside Java, for 
those, you have no choice but wait in Arena.close() until all threads have 
answered to the handshakes. For all the other memory segments, because they are 
only visible from Java, their memory can be reclaimed asynchronously, i.e. the 
last thread of the handshakes can free the corresponding memory segments, so 
the thread that call Arena.close() is free to run even if the memory is not yet 
reclaimed.

>From my armchair, that seems a awful lot of engeneering so it may not worth 
>it, but now you know :)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20158#issuecomment-2226858328

Reply via email to