On Mon, 15 Jul 2024 14:02:27 GMT, Maurizio Cimadamore <mcimadam...@openjdk.org> 
wrote:

> So, that means that `arrayElementVarHandle` is ~4x faster than memory 
> segment? Isn't that a bit odd?

I did some more analyis of the benchmark. I first eliminated the closing 
thread, and started with two simple benchmarks:


@Benchmark
public int memorySegmentAccess() {
        int sum = 0;
        for (int i = 0; i < segment.byteSize(); i++) {
            sum += segment.get(JAVA_BYTE, i);
        }
        return sum;
    }


and


@Benchmark
public int otherAccess() {
        int sum = 0;
        for (int i = 0; i < array.length; i++) {
            sum += (byte)BYTE_HANDLE.get(array, i);
        }
        return sum;
    }


where the setup code is as follows:


static final int SIZE = 10_000;

    MemorySegment segment;
    byte[] array;

    static final VarHandle BYTE_HANDLE = 
MethodHandles.arrayElementVarHandle(byte[].class);

    @Setup
    public void setup() {
        array = new byte[SIZE];
        segment = MemorySegment.ofArray(array);
    }


With this, I obtained the following results:


Benchmark                            Mode  Cnt   Score   Error  Units
ConcurrentClose.memorySegmentAccess  avgt   10  13.879 ± 0.478  us/op
ConcurrentClose.otherAccess          avgt   10   2.256 ± 0.017  us/op


Ugh. It seems like C2 "blows up" at the third iteration:


# Run progress: 0.00% complete, ETA 00:05:00
# Fork: 1 of 1
# Warmup Iteration   1: 6.712 us/op
# Warmup Iteration   2: 5.756 us/op
# Warmup Iteration   3: 13.267 us/op
# Warmup Iteration   4: 13.267 us/op
# Warmup Iteration   5: 13.274 us/op


This might be a bug/regression. But, let's move on. I then tweaked the 
induction variable of the memory segment loop to be `long`, not `int` and I got:


Benchmark                            Mode  Cnt  Score   Error  Units
ConcurrentClose.memorySegmentAccess  avgt   10  2.764 ± 0.016  us/op
ConcurrentClose.otherAccess          avgt   10  2.240 ± 0.016  us/op


Far more respectable! And now we have a good baseline, since both workloads 
take amount the same time, so we can use them to draw interesting comparisons. 
So, let's add back a thread that does a shared arena close:


Benchmark                                        Mode  Cnt   Score   Error  
Units
ConcurrentClose.sharedClose                      avgt   10  12.001 ± 0.061  
us/op
ConcurrentClose.sharedClose:closing              avgt   10  19.281 ± 0.323  
us/op
ConcurrentClose.sharedClose:memorySegmentAccess  avgt   10   9.802 ± 0.314  
us/op
ConcurrentClose.sharedClose:otherAccess          avgt   10   6.921 ± 0.151  
us/op


This is with vanilla JDK. If I apply the changes in this PR, I get this:


Benchmark                                        Mode  Cnt   Score   Error  
Units
ConcurrentClose.sharedClose                      avgt   10  10.837 ± 0.241  
us/op
ConcurrentClose.sharedClose:closing              avgt   10  20.337 ± 1.674  
us/op
ConcurrentClose.sharedClose:memorySegmentAccess  avgt   10   8.672 ± 0.993  
us/op
ConcurrentClose.sharedClose:otherAccess          avgt   10   3.501 ± 0.162  
us/op


This is good. Note how `otherAccess` improved almost 2x, as the code is no 
longer redundantly de-optimized. Now, we know that, even for memory segment 
access, we can avoid redundant deopt once JDK-8290892 is fixed. To simulate 
that, I've dropped the lines which apply the conservative deoptimization in 
`scopedMemoryAccess.cpp` and ran the bench again:


Benchmark                                        Mode  Cnt   Score   Error  
Units
ConcurrentClose.sharedClose                      avgt   10   8.957 ± 0.089  
us/op
ConcurrentClose.sharedClose:closing              avgt   10  18.898 ± 0.338  
us/op
ConcurrentClose.sharedClose:memorySegmentAccess  avgt   10   4.403 ± 0.054  
us/op
ConcurrentClose.sharedClose:otherAccess          avgt   10   3.571 ± 0.042  
us/op


Ok, now both accessor threads seem faster.

If I swap the shared arena close with a confined arena close I get this:


Benchmark                                        Mode  Cnt   Score    Error  
Units
ConcurrentClose.sharedClose                      avgt   10   1.760 ±  0.008  
us/op
ConcurrentClose.sharedClose:closing              avgt   10  ≈ 10⁻³           
us/op
ConcurrentClose.sharedClose:memorySegmentAccess  avgt   10   2.912 ±  0.016  
us/op
ConcurrentClose.sharedClose:otherAccess          avgt   10   2.367 ±  0.009  
us/op


Summing up:
* there is some issue involving segment access with `int` induction variable 
which we should investigate separately
* this PR significantly improves performance of threads that are not touching 
memory segments, even under heavy shared arena close loads
* performance of unrelated memory segment access is still affected by 
concurrent shared arena close. This is due to conservative deoptimization which 
will be removed once JDK-8290892 is fixed
* when all fixes will be applied, the performance of the accessing threads gets 
quite close to ideal, but not 100% there. The loss seems in the acceptable 
range - given that this benchmark is closing shared arenas in a loop, arguably 
the worst possible case.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20158#issuecomment-2228916752

Reply via email to