On Thu, 18 Jun 2026 15:22:37 GMT, Per Minborg <[email protected]> wrote:
>> ## Summary
>>
>> This PR proposes to introduce a pooled confined arena as an optimization for
>> `Arena.ofConfined()`, where small native allocations can be served from a
>> reusable per-thread/per-slot memory pool instead of calling the regular
>> native allocator for every short-lived arena. The arena remains confined to
>> its owner thread and is still closed normally, but its backing storage can
>> be reset and reused when the arena closes. The feature requires no API
>> changes.
>>
>> ### Outline
>>
>> Platform threads: one lazily allocated pool per Thread, encoded in
>> `Thread.confinedMemoryPool`.
>> Virtual threads: fixed shared native pool with CAS-protected slots, because
>> per-virtual-thread native pools would not scale.
>>
>> Pooled memory is zeroed out upon _closing_ an Arena to minimize data
>> visibility between reuse. This means the data is visible only within a TWR
>> block, and never outside it.
>>
>> By default, a confined arena has access to 64 bytes of pooled data. The
>> pool size is configurable via a system property and can be 8, 16, 32, or 64
>> bytes. Pooling can also be turned off completely by setting the pool
>> power-of-two size to zero. Nested confined arenas are not supported
>>
>> ## Static Analysis
>>
>> An extensive static corpus analysis of third-party libraries and the JDK
>> itself has been conducted with respect to `Area.ofConfined()` usage,
>> revealing that confined arenas were used _only_ in TWR blocks and _never_ in
>> an unstructured way. The static analysis further revealed that in most
>> cases, only a small amount of native memory was ever allocated, usually less
>> than 32 bytes, and in many cases, 8 bytes or less. This usage pattern lends
>> itself well to pooling.
>>
>> ## Dynamic Analysis
>>
>> A dynamic statistical analysis of actual runs was also made, where various
>> properties of confined arenas were recorded and summarized during a complete
>> tier1 test run. While a tier1 run is not necessarily representative of a
>> typical application workload, it provided some interesting results:
>>
>> The run produced 93 per-process histogram blocks and 788,773,092 closed
>> confined arenas. The result is dominated by arenas with no native allocation
>> at all: 375,934,768 arenas (47.661%) are in the zero-byte bucket. Counting
>> arenas up to 63 bytes covers 99.997% of all arena closures.
>>
>> The largest count bucket is 8-15 bytes per arena with 400,951,293 arenas
>> (50.832% of all arenas). The largest byte bucket is 8-15 bytes per arena
>> with 3,207,623,039 B (3,059.03 MiB) (46.794% of all by...
>
> Per Minborg has updated the pull request incrementally with one additional
> commit since the last revision:
>
> Improve performance for VTs
I've left some comments on the implementation inline.
Besides the discussion points that you've mentioned, I think there are a few
more points worth discussing:
1. The pool for virtual threads is shared, and has a fixed number of slots
based on the number of processors. This doesn't scale with the number of
virtual threads, which I think would be closer to ideal (more virtual threads,
more slots). I think the only reason for having a separate fixed pool for
virtual threads is to avoid the additional `long` field in the virtual thread
instance, right? If we could use the same implementation for both platform and
virtual threads that would make the overall implementation of the pool a lot
simpler.
2. The pool doesn't work for nested confined sessions that need to allocate.
Though the pool isn't acquired eagerly anymore, but only on the first
allocation, the pool can still only be used by the _first_ confined arena that
needs to allocate. This is a bit unfortunate, but I'm honestly not sure how
important the nested case is. Would it be possible to enhance the pool to
support multiple 64-byte slabs, so that several nested confined arenas could
benefit?
src/java.base/share/classes/java/lang/Thread.java line 302:
> 300: *
> 301: * PTRDIFF_MAX (usually 2<sup>63</sup>-1) allows us to use the
> sign bit
> 302: * as a flag for acquire state without conflicting with malloc()
> return values.
Why does `PTRDIFF_MAX` allow us to do this? I believe in the past we had
reasons to believe that addresses might use the sign bit on some systems (e.g.
for pointer tagging). `PTRDIFF_MAX` is also system dependent AFAICT.
I see we will never get false positives because you've added a check when
allocating the pool, but I'm not sure we can rely on this as an invariant.
src/java.base/share/classes/jdk/internal/foreign/ArenaImpl.java line 54:
> 52: }
> 53:
> 54: @ForceInline
Still looks like there are some unneeded `@ForceInline` annotations.
src/java.base/share/classes/jdk/internal/foreign/ArenaImpl.java line 96:
> 94: @ForceInline
> 95: NativeMemorySegmentImpl allocateLowLevel(long byteSize, long
> byteAlignment, boolean init) {
> 96: final long poolSize = ConfinedPoolHolder.POOL_SIZE;
Please remove the use of `final` on local variables.
src/java.base/share/classes/jdk/internal/foreign/ArenaImpl.java line 112:
> 110: if (pool > 0 && (segment = trySlice(pool,
> allocationByteSize, byteAlignment, poolSize)) != null) {
> 111: // Preserve the invariant that zero-sized segments
> have unique addresses
> 112: // for any given Arena
I think this comment should be on the `final long allocationByteSize =
Math.max(1, byteSize);` line?
src/java.base/share/classes/jdk/internal/foreign/ArenaImpl.java line 122:
> 120:
> 121: @ForceInline
> 122: private NativeMemorySegmentImpl trySlice(long pool, long
> byteSize, long byteAlignment, long poolSize) {
I think you could potentially still implement Muarizio's suggestion to delay
the creation of the segment instance. We could e.g. have an out-of-line
allocation path that returns `long`, and then some code that wraps the result
into a memory segment instance. Since no objects are passed over the inlining
boundary, this should still be fast, but saves the inlining budget for other
things in the enclosing code.
src/java.base/share/classes/jdk/internal/foreign/ConfinedSegmentPool.java line
59:
> 57:
> 58: // The following values can be observed {-1 (disabled), 8, 16, 32 or
> 64} bytes
> 59: static final long POOLED_MEMORY_SIZE =
> clampedPowerOfPropertyOr(POOLED_MEMORY_PROPERTY, 6);
Maybe consider making this `private` and using `pooledMemorySize()` everywhere,
or making this public instead.
src/java.base/share/classes/jdk/internal/foreign/ConfinedSegmentPool.java line
64:
> 62:
> 63: @ForceInline
> 64: public static boolean isVirtualPoolInitialized() {
Looks unused outside of the class?
Suggestion:
private static boolean isVirtualPoolInitialized() {
src/java.base/share/classes/jdk/internal/foreign/ConfinedSegmentPool.java line
142:
> 140:
> 141: @ForceInline
> 142: private static long acquirePlatform(Thread thread) {
I suggest adding an assert that `thread` is the current thread, to make it
clear that there's no race between the `get`/`set` below
Suggestion:
private static long acquirePlatform(Thread thread) {
assert thread == Thread.currentThread();
src/java.base/share/classes/jdk/internal/foreign/ConfinedSegmentPool.java line
254:
> 252:
> 253: // Sentinel value for no pooling.
> 254: // Selecting -1 rather than zero allows constant folding.
Constant folding is possible either way. The non-default restriction only
applies to `@Stable` variables.
src/java.base/share/classes/jdk/internal/foreign/ConfinedSegmentPool.java line
308:
> 306: zeroOutMemory(address, size);
> 307: if (!U.compareAndSetLong(null, ownerAddress, owner,
> RELEASED)) {
> 308: throw new IllegalStateException("Cannot release pooled
> memory: " + thread);
Should this also return `false` and let the caller throw the exception?
src/java.base/share/classes/jdk/internal/foreign/ConfinedSegmentPool.java line
369:
> 367: return target <= 1
> 368: ? 1
> 369: : Integer.highestOneBit(target - 1) << 1;
`availableProcessors` is guaranteed to return a number "never smaller than 1".
So this conditional isn't needed since `target` will always be at least 2.
Suggestion:
return Integer.highestOneBit(target - 1) << 1;
-------------
PR Review: https://git.openjdk.org/jdk/pull/31365#pullrequestreview-4525782103
PR Review Comment: https://git.openjdk.org/jdk/pull/31365#discussion_r3443266973
PR Review Comment: https://git.openjdk.org/jdk/pull/31365#discussion_r3443036856
PR Review Comment: https://git.openjdk.org/jdk/pull/31365#discussion_r3443067664
PR Review Comment: https://git.openjdk.org/jdk/pull/31365#discussion_r3443056776
PR Review Comment: https://git.openjdk.org/jdk/pull/31365#discussion_r3443045923
PR Review Comment: https://git.openjdk.org/jdk/pull/31365#discussion_r3443199398
PR Review Comment: https://git.openjdk.org/jdk/pull/31365#discussion_r3443188447
PR Review Comment: https://git.openjdk.org/jdk/pull/31365#discussion_r3443220293
PR Review Comment: https://git.openjdk.org/jdk/pull/31365#discussion_r3443318953
PR Review Comment: https://git.openjdk.org/jdk/pull/31365#discussion_r3443339616
PR Review Comment: https://git.openjdk.org/jdk/pull/31365#discussion_r3443362654