On Wed, 22 Jan 2025 09:57:15 GMT, Matthias Ernst <d...@openjdk.org> wrote:
>> Certain signatures for foreign function calls (e.g. HVA return by value) >> require allocation of an intermediate buffer to adapt the FFM's to the >> native stub's calling convention. In the current implementation, this buffer >> is malloced and freed on every FFM invocation, a non-negligible overhead. >> >> Sample stack trace: >> >> java.lang.Thread.State: RUNNABLE >> at jdk.internal.misc.Unsafe.allocateMemory0(java.base@25-ea/Native >> Method) >> ... >> at >> jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base@25-ea/SharedUtils.java:386) >> at >> jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base@25-ea/Unknown >> Source) >> ... >> at >> java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base@25-ea/Invokers$Holder) >> >> >> To alleviate this, this PR remembers and reuses up to two small intermediate >> buffers per carrier-thread in subsequent calls. >> >> Performance (MBA M3): >> >> >> Before: >> Benchmark Mode Cnt Score Error Units >> CallOverheadByValue.byPtr avgt 10 3.333 ? 0.152 ns/op >> CallOverheadByValue.byValue avgt 10 33.892 ? 0.034 ns/op >> >> After: >> Benchmark Mode Cnt Score Error Units >> CallOverheadByValue.byPtr avgt 10 3.291 ? 0.031 ns/op >> CallOverheadByValue.byValue avgt 10 5.464 ? 0.007 ns/op >> >> >> `-prof gc` also shows that the new call path is fully scalar-replaced vs 160 >> byte/call before. > > Matthias Ernst has updated the pull request incrementally with one additional > commit since the last revision: > > Back buffer allocation with a single carrier-local segment. > just need a single buffer > Alternatively we can use locking I think these are really really great suggestions, thank you! It simplifies things tremendously, I've pushed a version of it. As you say, the errno / state capture piece can probably just use it, too. The extra atomics for acquiring/releasing don't seem to cost that much, so this has still excellent performance (and is also alloc-free): Benchmark Mode Cnt Score Error Units CallOverheadByValue.byPtr avgt 30 3.375 ? 0.138 ns/op CallOverheadByValue.byValue avgt 30 6.625 ? 0.057 ns/op I'll leave this here for inspiration, I'll add a few unit tests for the stack, but feel free to just close it in favor of related work. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23142#issuecomment-2606794554