Re: RFR: 8287788: reuse intermediate segments allocated during FFM stub invocations [v2]

Jorn Vernee Mon, 20 Jan 2025 06:13:11 -0800

On Mon, 20 Jan 2025 07:30:17 GMT, Matthias Ernst <d...@openjdk.org> wrote:


>> Certain signatures for foreign function calls (e.g. HVA return by value) 
>> require allocation of an intermediate buffer to adapt the FFM's to the 
>> native stub's calling convention. In the current implementation, this buffer 
>> is malloced and freed on every FFM invocation, a non-negligible overhead.
>> 
>> Sample stack trace:
>> 
>>    java.lang.Thread.State: RUNNABLE
>>      at jdk.internal.misc.Unsafe.allocateMemory0(java.base@25-ea/Native 
>> Method)
>> ...
>>      at 
>> jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base@25-ea/SharedUtils.java:386)
>>      at 
>> jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base@25-ea/Unknown
>>  Source)
>> ...
>>      at 
>> java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base@25-ea/Invokers$Holder)
>> 
>> 
>> To alleviate this, this PR remembers and reuses up to two small intermediate 
>> buffers per carrier-thread in subsequent calls.
>> 
>> Performance (MBA M3):
>> 
>> 
>> master@764d70b7df18e288582e616c62b0d7078f1ff3aa
>> Benchmark                         Mode  Cnt    Score    Error  Units
>> PointsAlloc.circle_by_ptr         avgt   30    9.197 ?  0.037  ns/op
>> PointsAlloc.circle_by_value       avgt   30   42.195 ?  0.088  ns/op   <= 
>> #######
>> PointsAlloc.jni_ByteBuffer_alloc  avgt   30  226.127 ? 35.378  ns/op
>> PointsAlloc.jni_long_alloc        avgt   30   25.297 ?  2.457  ns/op
>> PointsAlloc.panama_alloc          avgt   30   27.053 ?  1.915  ns/op
>> 
>> After:
>> Benchmark                         Mode  Cnt    Score    Error  Units
>> PointsAlloc.circle_by_ptr         avgt   30    9.156 ?  0.021  ns/op
>> PointsAlloc.circle_by_value       avgt   30   11.995 ?  0.051  ns/op   <= 
>> #######
>> PointsAlloc.jni_ByteBuffer_alloc  avgt   30  211.161 ? 23.284  ns/op
>> PointsAlloc.jni_long_alloc        avgt   30   24.885 ?  2.461  ns/op
>> PointsAlloc.panama_alloc          avgt   30   26.905 ?  1.935  ns/op
>> 
>> 
>> `-prof gc` also shows that the new call path is fully scalar-replaced vs 160 
>> byte/call before.
>
> Matthias Ernst has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Implementation notes.

Please also revert the unrelated formatting changes.

src/java.base/share/classes/jdk/internal/foreign/abi/SharedUtils.java line 396:

> 394:         long address = fromCache != 0 ? fromCache : 
> CallBufferCache.allocate(bufferSize);
> 395:         return new 
> BoundedArena(MemorySegment.ofAddress(address).reinterpret(size));
> 396:     }

Looks like we're using `reinterpret` in the fallback case as well? 

I suggest structuring the code like this:

Suggestion:

        final Arena arena = Arena.ofConfined();
        MemorySegment segment = CallBufferCache.acquire(arena, size);
        if (segment == null) {
            segment = arena.allocate(size);
        }
        final SegmentAllocator slicingAllocator = 
SegmentAllocator.slicingAllocator(segment);
        return new Arena() {
            @Override
            public Scope scope() {
                return arena.scope();
            }
            @Override
            public void close() {
                arena.close();
            }
            @Override
            public MemorySegment allocate(long byteSize, long byteAlignment) {
                return slicingAllocator.allocate(byteSize, byteAlignment);
            }
        };
    }


i.e. encapsulate all the caching logic into the `CallBufferCache` class. Note 
that the 'release' action can be attached to `arena` as a cleanup action.

test/micro/org/openjdk/bench/java/lang/foreign/points/PointsAlloc.java line 81:

> 79:             return Circle.byPtr(arena, NUM_CIRCLE_POINTS);
> 80:         }
> 81:     }

I think it would be better to put the new benchmark in a separate benchmark 
class (e.g. a new variant of the `CallOverheadXXX` benchmark. There should 
already be some examples of benchmarks that pass structs), where the time unit 
is set to nanoseconds, so we can measure just the overhead of a single call.

-------------

PR Review: https://git.openjdk.org/jdk/pull/23142#pullrequestreview-2562430579
PR Review Comment: https://git.openjdk.org/jdk/pull/23142#discussion_r1922426585
PR Review Comment: https://git.openjdk.org/jdk/pull/23142#discussion_r1922457699

Re: RFR: 8287788: reuse intermediate segments allocated during FFM stub invocations [v2]

Reply via email to