On Mon, 15 Jul 2024 16:30:11 GMT, Maurizio Cimadamore <mcimadam...@openjdk.org> wrote:
>>> > > Even with `arrayElementVarHandle` it's about the same >>> > >>> > >>> > This is very odd, and I don't have a good explanation as to why that is >>> > the case. What does the baseline (confined arena) look like for >>> > `arrayElementVarHandle` ? >>> >>> Pretty much exactly the same >> >> So, that means that `arrayElementVarHandle` is ~4x faster than memory >> segment? Isn't that a bit odd? > >> So, that means that `arrayElementVarHandle` is ~4x faster than memory >> segment? Isn't that a bit odd? > > I did some more analyis of the benchmark. I first eliminated the closing > thread, and started with two simple benchmarks: > > > @Benchmark > public int memorySegmentAccess() { > int sum = 0; > for (int i = 0; i < segment.byteSize(); i++) { > sum += segment.get(JAVA_BYTE, i); > } > return sum; > } > > > and > > > @Benchmark > public int otherAccess() { > int sum = 0; > for (int i = 0; i < array.length; i++) { > sum += (byte)BYTE_HANDLE.get(array, i); > } > return sum; > } > > > where the setup code is as follows: > > > static final int SIZE = 10_000; > > MemorySegment segment; > byte[] array; > > static final VarHandle BYTE_HANDLE = > MethodHandles.arrayElementVarHandle(byte[].class); > > @Setup > public void setup() { > array = new byte[SIZE]; > segment = MemorySegment.ofArray(array); > } > > > With this, I obtained the following results: > > > Benchmark Mode Cnt Score Error Units > ConcurrentClose.memorySegmentAccess avgt 10 13.879 ± 0.478 us/op > ConcurrentClose.otherAccess avgt 10 2.256 ± 0.017 us/op > > > Ugh. It seems like C2 "blows up" at the third iteration: > > > # Run progress: 0.00% complete, ETA 00:05:00 > # Fork: 1 of 1 > # Warmup Iteration 1: 6.712 us/op > # Warmup Iteration 2: 5.756 us/op > # Warmup Iteration 3: 13.267 us/op > # Warmup Iteration 4: 13.267 us/op > # Warmup Iteration 5: 13.274 us/op > > > This might be a bug/regression. But, let's move on. I then tweaked the > induction variable of the memory segment loop to be `long`, not `int` and I > got: > > > Benchmark Mode Cnt Score Error Units > ConcurrentClose.memorySegmentAccess avgt 10 2.764 ± 0.016 us/op > ConcurrentClose.otherAccess avgt 10 2.240 ± 0.016 us/op > > > Far more respectable! And now we have a good baseline, since both workloads > take amount the same time, so we can use them to draw interesting > comparisons. So, let's add back a thread that does a shared arena close: > > > Benchmark Mode Cnt Score Error > Units > ConcurrentClose.sharedClose avgt 10 12.001 ± 0.061 > us/op > ConcurrentClose.sharedClose:closing avgt 10 19.281 ± 0.323 > us/op > ConcurrentClose.sharedClose:memorySegmentAccess avgt 10 9.802 ± 0.314 > us/op > ConcurrentClose.sharedClose:otherAccess avgt 1... Thanks @mcimadamore, this sound great! I am so happy that we at least reduced the overhead for non-memory segment threads. This will also be the case for Lucene/Solr because we do not read from segments all the time, we also have other code sometimes executed between reads from memory segments :-) So +1 to merge this and hopefully backport it at least to 21? This would be great, but as it is not a bug not strictly necessary. We should open issues for the int problem and work on JDK-8290892. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20158#issuecomment-2228926251