On Mon, 20 Jan 2025 07:30:17 GMT, Matthias Ernst <d...@openjdk.org> wrote:
>> Certain signatures for foreign function calls (e.g. HVA return by value) >> require allocation of an intermediate buffer to adapt the FFM's to the >> native stub's calling convention. In the current implementation, this buffer >> is malloced and freed on every FFM invocation, a non-negligible overhead. >> >> Sample stack trace: >> >> java.lang.Thread.State: RUNNABLE >> at jdk.internal.misc.Unsafe.allocateMemory0(java.base@25-ea/Native >> Method) >> ... >> at >> jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base@25-ea/SharedUtils.java:386) >> at >> jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base@25-ea/Unknown >> Source) >> ... >> at >> java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base@25-ea/Invokers$Holder) >> >> >> To alleviate this, this PR remembers and reuses up to two small intermediate >> buffers per carrier-thread in subsequent calls. >> >> Performance (MBA M3): >> >> >> master@764d70b7df18e288582e616c62b0d7078f1ff3aa >> Benchmark Mode Cnt Score Error Units >> PointsAlloc.circle_by_ptr avgt 30 9.197 ? 0.037 ns/op >> PointsAlloc.circle_by_value avgt 30 42.195 ? 0.088 ns/op <= >> ####### >> PointsAlloc.jni_ByteBuffer_alloc avgt 30 226.127 ? 35.378 ns/op >> PointsAlloc.jni_long_alloc avgt 30 25.297 ? 2.457 ns/op >> PointsAlloc.panama_alloc avgt 30 27.053 ? 1.915 ns/op >> >> After: >> Benchmark Mode Cnt Score Error Units >> PointsAlloc.circle_by_ptr avgt 30 9.156 ? 0.021 ns/op >> PointsAlloc.circle_by_value avgt 30 11.995 ? 0.051 ns/op <= >> ####### >> PointsAlloc.jni_ByteBuffer_alloc avgt 30 211.161 ? 23.284 ns/op >> PointsAlloc.jni_long_alloc avgt 30 24.885 ? 2.461 ns/op >> PointsAlloc.panama_alloc avgt 30 26.905 ? 1.935 ns/op >> >> >> `-prof gc` also shows that the new call path is fully scalar-replaced vs 160 >> byte/call before. > > Matthias Ernst has updated the pull request incrementally with one additional > commit since the last revision: > > Implementation notes. Please also revert the unrelated formatting changes. src/java.base/share/classes/jdk/internal/foreign/abi/SharedUtils.java line 396: > 394: long address = fromCache != 0 ? fromCache : > CallBufferCache.allocate(bufferSize); > 395: return new > BoundedArena(MemorySegment.ofAddress(address).reinterpret(size)); > 396: } Looks like we're using `reinterpret` in the fallback case as well? I suggest structuring the code like this: Suggestion: final Arena arena = Arena.ofConfined(); MemorySegment segment = CallBufferCache.acquire(arena, size); if (segment == null) { segment = arena.allocate(size); } final SegmentAllocator slicingAllocator = SegmentAllocator.slicingAllocator(segment); return new Arena() { @Override public Scope scope() { return arena.scope(); } @Override public void close() { arena.close(); } @Override public MemorySegment allocate(long byteSize, long byteAlignment) { return slicingAllocator.allocate(byteSize, byteAlignment); } }; } i.e. encapsulate all the caching logic into the `CallBufferCache` class. Note that the 'release' action can be attached to `arena` as a cleanup action. test/micro/org/openjdk/bench/java/lang/foreign/points/PointsAlloc.java line 81: > 79: return Circle.byPtr(arena, NUM_CIRCLE_POINTS); > 80: } > 81: } I think it would be better to put the new benchmark in a separate benchmark class (e.g. a new variant of the `CallOverheadXXX` benchmark. There should already be some examples of benchmarks that pass structs), where the time unit is set to nanoseconds, so we can measure just the overhead of a single call. ------------- PR Review: https://git.openjdk.org/jdk/pull/23142#pullrequestreview-2562430579 PR Review Comment: https://git.openjdk.org/jdk/pull/23142#discussion_r1922426585 PR Review Comment: https://git.openjdk.org/jdk/pull/23142#discussion_r1922457699