On Mon, 20 Jan 2025 18:43:54 GMT, Matthias Ernst <d...@openjdk.org> wrote:

>> Certain signatures for foreign function calls (e.g. HVA return by value) 
>> require allocation of an intermediate buffer to adapt the FFM's to the 
>> native stub's calling convention. In the current implementation, this buffer 
>> is malloced and freed on every FFM invocation, a non-negligible overhead.
>> 
>> Sample stack trace:
>> 
>>    java.lang.Thread.State: RUNNABLE
>>      at jdk.internal.misc.Unsafe.allocateMemory0(java.base@25-ea/Native 
>> Method)
>> ...
>>      at 
>> jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base@25-ea/SharedUtils.java:386)
>>      at 
>> jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base@25-ea/Unknown
>>  Source)
>> ...
>>      at 
>> java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base@25-ea/Invokers$Holder)
>> 
>> 
>> To alleviate this, this PR remembers and reuses up to two small intermediate 
>> buffers per carrier-thread in subsequent calls.
>> 
>> Performance (MBA M3):
>> 
>> 
>> Before:
>> Benchmark                    Mode  Cnt   Score   Error  Units
>> CallOverheadByValue.byPtr    avgt   10   3.333 ? 0.152  ns/op
>> CallOverheadByValue.byValue  avgt   10  33.892 ? 0.034  ns/op
>> 
>> After:
>> Benchmark                         Mode  Cnt    Score    Error  Units
>> CallOverheadByValue.byPtr    avgt   10  3.291 ? 0.031  ns/op
>> CallOverheadByValue.byValue  avgt   10  5.464 ? 0.007  ns/op
>> 
>> 
>> `-prof gc` also shows that the new call path is fully scalar-replaced vs 160 
>> byte/call before.
>
> Matthias Ernst has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   restore 3 forks

Talking to Maurizio offline, and we realized that if we just pin the 
continuation when we acquire the buffer, and unpin when releasing, we don't 
have to worry about buffers floating between threads between acquire & release, 
and we can also re-use the buffer in consecutive calls (like a bump allocator), 
meaning we just need a single buffer, instead of a two element cache, and we 
might be able to use it for more than 2 calls. Pinning the continuation 
wouldn't be a problem since we're about to do a native call any way, which will 
also pin it.

We would need to wait until: https://bugs.openjdk.org/browse/JDK-8347997 is 
fixed, which seems like a good idea either way, so we have more options when 
implementing this.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23142#issuecomment-2605284762

Reply via email to