Certain signatures for foreign function calls require allocation of an
intermediate buffer to adapt the FFM's to the native stub's calling convention
("needsReturnBuffer"). In the current implementation, this buffer is malloced
and freed on every FFM invocation, a non-negligible overhead.
Sample stack trace:
java.lang.Thread.State: RUNNABLE
at jdk.internal.misc.Unsafe.allocateMemory0(java.base@25-ea/Native
Method)
at
jdk.internal.misc.Unsafe.allocateMemory(java.base@25-ea/Unsafe.java:636)
at
jdk.internal.foreign.SegmentFactories.allocateMemoryWrapper(java.base@25-ea/SegmentFactories.java:215)
at
jdk.internal.foreign.SegmentFactories.allocateSegment(java.base@25-ea/SegmentFactories.java:193)
at
jdk.internal.foreign.ArenaImpl.allocateNoInit(java.base@25-ea/ArenaImpl.java:55)
at
jdk.internal.foreign.ArenaImpl.allocate(java.base@25-ea/ArenaImpl.java:60)
at
jdk.internal.foreign.ArenaImpl.allocate(java.base@25-ea/ArenaImpl.java:34)
at
java.lang.foreign.SegmentAllocator.allocate(java.base@25-ea/SegmentAllocator.java:645)
at
jdk.internal.foreign.abi.SharedUtils$2.<init>(java.base@25-ea/SharedUtils.java:388)
at
jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base@25-ea/SharedUtils.java:386)
at
jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base@25-ea/Unknown
Source)
at
java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base@25-ea/DirectMethodHandle$Holder)
at
java.lang.invoke.LambdaForm$MH/0x000001f00109a400.invoke(java.base@25-ea/LambdaForm$MH)
at
java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base@25-ea/Invokers$Holder)
When does this happen? A fairly easy way to trigger this is through returning a
small aggregate like the following:
struct Vector2D {
double x, y;
};
Vector2D Origin() {
return {0, 0};
}
On AArch64, such a struct is returned in two 128 bit registers v0/v1.
The VM's calling convention for the native stub consequently expects an 32 byte
output segment argument.
The FFM downcall method handle instead expects to create a 16 byte result
segment through the application-provided SegmentAllocator, and needs to perform
an appropriate adaptation, roughly like so:
MemorySegment downcallMH(SegmentAllocator a) {
MemorySegment tmp = SharedUtils.allocate(32);
try {
nativeStub.invoke(tmp); // leaves v0, v1 in tmp
MemorySegment result = a.allocate(16);
result.setDouble(0, tmp.getDouble(0));
result.setDouble(8, tmp.getDouble(16));
return result;
} finally {
free(tmp);
}
}
You might argue that this cost is not worse than what happens through the
result allocator. However, the application has control over this, and may
provide a segment-reusing allocator in a loop:
MemorySegment result = allocate(resultLayout);
SegmentAllocator allocator = (_, _)->result;
loop:
mh.invoke(allocator); <= would like to avoid hidden allocations in here
To alleviate this, This PR remembers and reuses one single such intermediate
buffer per carrier-thread in subsequent calls, very similar to what happens in
the sun.nio.ch.Util.BufferCache or sun.nio.fs.NativeBuffers, which face a
similar issues.
Performance (MBA M3):
# VM version: JDK 25-ea, OpenJDK 64-Bit Server VM, 25-ea+3-283
Benchmark Mode Cnt Score
Error Units
PointsAlloc.circle_by_ptr avgt 5 8.964 ± 0.351
ns/op
PointsAlloc.circle_by_ptr:·gc.alloc.rate avgt 5 95.301 ± 3.665
MB/sec
PointsAlloc.circle_by_ptr:·gc.alloc.rate.norm avgt 5 0.224 ± 0.001
B/op
PointsAlloc.circle_by_ptr:·gc.count avgt 5 2.000
counts
PointsAlloc.circle_by_ptr:·gc.time avgt 5 3.000
ms
PointsAlloc.circle_by_value avgt 5 46.498 ± 2.336
ns/op
PointsAlloc.circle_by_value:·gc.alloc.rate avgt 5 13141.578 ± 650.425
MB/sec
PointsAlloc.circle_by_value:·gc.alloc.rate.norm avgt 5 160.224 ± 0.001
B/op
PointsAlloc.circle_by_value:·gc.count avgt 5 116.000
counts
PointsAlloc.circle_by_value:·gc.time avgt 5 44.000
ms
# VM version: JDK 25-internal, OpenJDK 64-Bit Server VM,
25-internal-adhoc.mernst.jdk
Benchmark Mode Cnt Score Error
Units
PointsAlloc.circle_by_ptr avgt 5 9.108 ± 0.477
ns/op
PointsAlloc.circle_by_ptr:·gc.alloc.rate avgt 5 93.792 ± 4.898
MB/sec
PointsAlloc.circle_by_ptr:·gc.alloc.rate.norm avgt 5 0.224 ± 0.001
B/op
PointsAlloc.circle_by_ptr:·gc.count avgt 5 2.000
counts
PointsAlloc.circle_by_ptr:·gc.time avgt 5 4.000
ms
PointsAlloc.circle_by_value avgt 5 13.180 ± 0.611
ns/op
PointsAlloc.circle_by_value:·gc.alloc.rate avgt 5 64.816 ± 2.964
MB/sec
PointsAlloc.circle_by_value:·gc.alloc.rate.norm avgt 5 0.224 ± 0.001
B/op
PointsAlloc.circle_by_value:·gc.count avgt 5 2.000
counts
PointsAlloc.circle_by_value:·gc.time avgt 5 5.000
ms
-------------
Commit messages:
- tiny stylistic changes
- Storing segment addresses instead of objects in the cache appears to be
slightly faster. Write barrier?
- (c)
- unit test
- move CallBufferCache out
- shave off a couple more nanos
- Add comparison benchmark for out-parameter.
- copyright header
- Benchmark:
- move pinned cache lookup out of constructor.
- ... and 20 more: https://git.openjdk.org/jdk/compare/8460072f...4a2210df
Changes: https://git.openjdk.org/jdk/pull/23142/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23142&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8287788
Stats: 402 lines in 7 files changed: 377 ins; 0 del; 25 mod
Patch: https://git.openjdk.org/jdk/pull/23142.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/23142/head:pull/23142
PR: https://git.openjdk.org/jdk/pull/23142