On Wed, 22 Jan 2025 09:57:15 GMT, Matthias Ernst <d...@openjdk.org> wrote:

>> Certain signatures for foreign function calls (e.g. HVA return by value) 
>> require allocation of an intermediate buffer to adapt the FFM's to the 
>> native stub's calling convention. In the current implementation, this buffer 
>> is malloced and freed on every FFM invocation, a non-negligible overhead.
>> 
>> Sample stack trace:
>> 
>>    java.lang.Thread.State: RUNNABLE
>>      at jdk.internal.misc.Unsafe.allocateMemory0(java.base@25-ea/Native 
>> Method)
>> ...
>>      at 
>> jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base@25-ea/SharedUtils.java:386)
>>      at 
>> jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base@25-ea/Unknown
>>  Source)
>> ...
>>      at 
>> java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base@25-ea/Invokers$Holder)
>> 
>> 
>> To alleviate this, this PR remembers and reuses up to two small intermediate 
>> buffers per carrier-thread in subsequent calls.
>> 
>> Performance (MBA M3):
>> 
>> 
>> Before:
>> Benchmark                    Mode  Cnt   Score   Error  Units
>> CallOverheadByValue.byPtr    avgt   10   3.333 ? 0.152  ns/op
>> CallOverheadByValue.byValue  avgt   10  33.892 ? 0.034  ns/op
>> 
>> After:
>> Benchmark                         Mode  Cnt    Score    Error  Units
>> CallOverheadByValue.byPtr    avgt   10  3.291 ? 0.031  ns/op
>> CallOverheadByValue.byValue  avgt   10  5.464 ? 0.007  ns/op
>> 
>> 
>> `-prof gc` also shows that the new call path is fully scalar-replaced vs 160 
>> byte/call before.
>
> Matthias Ernst has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Back buffer allocation with a single carrier-local segment.

> just need a single buffer
> Alternatively we can use locking

I think these are really really great suggestions, thank you!
It simplifies things tremendously, I've pushed a version of it.
As you say, the errno / state capture piece can probably just use it, too.

The extra atomics for acquiring/releasing don't seem to cost that much, so this 
has still excellent performance (and is also alloc-free):

Benchmark                    Mode  Cnt  Score   Error  Units
CallOverheadByValue.byPtr    avgt   30  3.375 ? 0.138  ns/op
CallOverheadByValue.byValue  avgt   30  6.625 ? 0.057  ns/op


I'll leave this here for inspiration, I'll add a few unit tests for the stack, 
but feel free to just close it in favor of related work.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23142#issuecomment-2606794554

Reply via email to