On Fri, 12 Jul 2024 20:59:26 GMT, Jorn Vernee <jver...@openjdk.org> wrote:

>> This PR limits the number of cases in which we deoptimize frames when 
>> closing a shared Arena. The initial intent of this was to improve the 
>> performance of shared arena closure in cases where a lot of threads are 
>> accessing and closing shared arenas at the same time (see attached 
>> benchmark), but unfortunately even disabling deoptimization altogether does 
>> not have a great effect on that benchmark.
>> 
>> Nevertheless, I think the extra logging/testing/benchmark code, and comments 
>> I've written, together with reducing the number of cases where we deoptimize 
>> (which makes it clearer exactly why we need to deoptimize in the first 
>> place), will be useful going forward. So, I've a create this PR out of them.
>> 
>> In this PR:
>> - Deoptimizing is now only done in cases where it's needed, instead of 
>> always. Which is in cases where we are not inside an `@Scoped` method, but 
>> are inside a compiled frame that has a scoped access somewhere inside of it.
>> - I've separated the stack walking code (`for_scope_method`) from the code 
>> that checks for a reference to the arena being closed 
>> (`is_accessing_session`), and added logging code to the former. That also 
>> required changing vframe code to accept an `ouputStream*` rather than always 
>> printing to `tty`.
>> - Added a new test (`TestConcurrentClose`), that tries to close many shared 
>> arenas at the same time, in order to stress that use case.
>> - Added a new benchmark (`ConcurrentClose`), that stresses the cases where 
>> many threads are accessing and closing shared arenas.
>> 
>> I've done several benchmark runs with different amounts of threads. The 
>> confined case stays much more consistent, while the shared cases balloons up 
>> in time spent quickly when there are more than 4 threads:
>> 
>> 
>> Benchmark                     Threads   Mode  Cnt     Score     Error  Units
>> ConcurrentClose.sharedAccess       32   avgt   10  9017.397 ± 202.870  us/op
>> ConcurrentClose.sharedAccess       24   avgt   10  5178.214 ± 164.922  us/op
>> ConcurrentClose.sharedAccess       16   avgt   10  2224.420 ± 165.754  us/op
>> ConcurrentClose.sharedAccess        8   avgt   10   593.828 ±   8.321  us/op
>> ConcurrentClose.sharedAccess        7   avgt   10   470.700 ±  22.511  us/op
>> ConcurrentClose.sharedAccess        6   avgt   10   386.697 ±  59.170  us/op
>> ConcurrentClose.sharedAccess        5   avgt   10   291.157 ±   7.023  us/op
>> ConcurrentClose.sharedAccess        4   avgt   10   209.178 ±   5.802  us/op
>> ConcurrentClose.sharedAccess        1   avgt   10  ...
>
> Jorn Vernee has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   track has_scoped_access for compiled methods

Hi Jorn,

Many thanks for working on this! 

I have one problem with the benchmark: I think it is not measuring the whole 
setup in a way that is our workload: The basic problem is that we don't want to 
deoptimize threads which are not related to MemorySegments. So basically, the 
throughput of those threads should not be affected. For threads currently in a 
memory-segment read it should have a bit of effect, but it should recover fast.

The given benchmark somehow only measures the following: It starts many 
threads; in each it opens a shared memory segment, does some work and closes 
it. So it measures the throughput of the whole "create shared/work on it/close 
shared" workload. Actually the problems we see in Lucene are more that we have 
many threads working on shared memory segments or on other tasks not related to 
memory segments at all, while a few threads are concurrently closing and 
opening new arenas. With more threads concurrently closing the arenas, also the 
throughput on other threads degrades.

So IMHO, the benchamrk should be improved to have a few threads (configurable) 
that open/close memory segments and a list of other threads that do other work 
and finally a list of threads reading from the memory segments opened by first 
thread. The testcase you wrote is more fitting the above workload. Maybe the 
benchmark should be setup more like the test. If you have a benchmark with that 
workload it should better show an improvement.

The current benchmark has the problem that it measures the whole 
open/work/close on shared sgements. And slosing a shared segment is always 
heavy, because it has to trigger and wait for the thread-local handshake.

Why is the test preventing inlining of the inner read method?

I may be able to benchmark a Lucene workload with a custom JDK build next week. 
It might be an idea to use the wrong DaCapoBenchmark (downgrade to older 
version before it has fixed 
https://github.com/dacapobench/dacapobench/issues/264 , specifically 
https://github.com/dacapobench/dacapobench/commit/76588b28d516ae19f51a80e7287d404385a2c146).

Uwe

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20158#issuecomment-2227303884

Reply via email to