On Mon, 20 Jan 2025 16:20:20 GMT, Matthias Ernst <d...@openjdk.org> wrote:

>> Certain signatures for foreign function calls (e.g. HVA return by value) 
>> require allocation of an intermediate buffer to adapt the FFM's to the 
>> native stub's calling convention. In the current implementation, this buffer 
>> is malloced and freed on every FFM invocation, a non-negligible overhead.
>> 
>> Sample stack trace:
>> 
>>    java.lang.Thread.State: RUNNABLE
>>      at jdk.internal.misc.Unsafe.allocateMemory0(java.base@25-ea/Native 
>> Method)
>> ...
>>      at 
>> jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base@25-ea/SharedUtils.java:386)
>>      at 
>> jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base@25-ea/Unknown
>>  Source)
>> ...
>>      at 
>> java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base@25-ea/Invokers$Holder)
>> 
>> 
>> To alleviate this, this PR remembers and reuses up to two small intermediate 
>> buffers per carrier-thread in subsequent calls.
>> 
>> Performance (MBA M3):
>> 
>> 
>> master@764d70b7df18e288582e616c62b0d7078f1ff3aa
>> Benchmark                         Mode  Cnt    Score    Error  Units
>> PointsAlloc.circle_by_ptr         avgt   30    9.197 ?  0.037  ns/op
>> PointsAlloc.circle_by_value       avgt   30   42.195 ?  0.088  ns/op   <= 
>> #######
>> PointsAlloc.jni_ByteBuffer_alloc  avgt   30  226.127 ? 35.378  ns/op
>> PointsAlloc.jni_long_alloc        avgt   30   25.297 ?  2.457  ns/op
>> PointsAlloc.panama_alloc          avgt   30   27.053 ?  1.915  ns/op
>> 
>> After:
>> Benchmark                         Mode  Cnt    Score    Error  Units
>> PointsAlloc.circle_by_ptr         avgt   30    9.156 ?  0.021  ns/op
>> PointsAlloc.circle_by_value       avgt   30   11.995 ?  0.051  ns/op   <= 
>> #######
>> PointsAlloc.jni_ByteBuffer_alloc  avgt   30  211.161 ? 23.284  ns/op
>> PointsAlloc.jni_long_alloc        avgt   30   24.885 ?  2.461  ns/op
>> PointsAlloc.panama_alloc          avgt   30   26.905 ?  1.935  ns/op
>> 
>> 
>> `-prof gc` also shows that the new call path is fully scalar-replaced vs 160 
>> byte/call before.
>
> Matthias Ernst has updated the pull request incrementally with three 
> additional commits since the last revision:
> 
>  - shift api boundary
>  - move bench
>  - revert formatting

src/java.base/share/classes/jdk/internal/foreign/abi/SharedUtils.java line 393:

> 391:         MemorySegment unscoped = CallBufferCache.acquireOrAllocate(size);
> 392:         Arena scope = Arena.ofConfined();
> 393:         MemorySegment source = unscoped.reinterpret(scope, null);

I suggest passing the `scope` to `acquireOrAllocate` as well, so we only need a 
single call to `reinterpret` (it has an overload that takes both a size and 
arena. (And I think you can then also remove the `@SuppressWarnings` from this 
method)

src/java.base/share/classes/jdk/internal/foreign/abi/SharedUtils.java line 395:

> 393:         MemorySegment source = unscoped.reinterpret(scope, null);
> 394:         // Preferable we'd like to register this cleanup in the line 
> above
> 395:         // but it breaks scalar replacement.

I have a fix for this issue. Will create a PR.

test/micro/org/openjdk/bench/java/lang/foreign/CallOverheadByValue.java line 54:

> 52: @State(org.openjdk.jmh.annotations.Scope.Thread)
> 53: @OutputTimeUnit(TimeUnit.NANOSECONDS)
> 54: @Fork(value = 1, jvmArgs = {"-Xlog:gc", 
> "--enable-native-access=ALL-UNNAMED", "-Djava.library.path=micro/native"})

FWIW, unfortunately there is no builtin support to run profilers through 
`make`. I personally have a separate script that runs the benchmarks.jar:


./build/$CONF/images/jdk/bin/java -jar 
./build/$CONF/images/test/micro/benchmarks.jar -prof gc AllocTest.alloc_confined


(where `$CONF` is your build configuration)

test/micro/org/openjdk/bench/java/lang/foreign/CallOverheadByValue.java line 98:

> 96:                             (SegmentAllocator) (_, _) -> dest,
> 97:                             phi);
> 98:         }

Would it be viable to measure just a single invocation? That way we can remove 
other things like the `asSlice` call, which might introduce noise.

test/micro/org/openjdk/bench/java/lang/foreign/CallOverheadByValue.java line 99:

> 97:                             phi);
> 98:         }
> 99:         return points;

I don't think `points` needs to be returned here?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23142#discussion_r1922701981
PR Review Comment: https://git.openjdk.org/jdk/pull/23142#discussion_r1922701268
PR Review Comment: https://git.openjdk.org/jdk/pull/23142#discussion_r1922699935
PR Review Comment: https://git.openjdk.org/jdk/pull/23142#discussion_r1922700908
PR Review Comment: https://git.openjdk.org/jdk/pull/23142#discussion_r1922698930

Reply via email to