On Thu, 16 Jan 2025 11:58:20 GMT, Per Minborg <pminb...@openjdk.org> wrote:
>> Going forward, converting older JDK code to use the relatively new FFM API >> requires system calls that can provide `errno` and the likes to explicitly >> allocate a MemorySegment to capture potential error states. This can lead to >> negative performance implications if not designed carefully and also >> introduces unnecessary code complexity. >> >> Hence, this PR proposes to add a _JDK internal_ method handle adapter that >> can be used to handle system calls with `errno`, `GetLastError`, and >> `WSAGetLastError`. >> >> It currently relies on a thread-local cache of MemorySegments to allide >> allocations. If, in the future, a more efficient thread-associated >> allocation scheme becomes available, we could easily migrate to that one. >> >> Here are some benchmarks: >> >> >> Benchmark Mode Cnt Score Error >> Units >> CaptureStateUtilBench.explicitAllocationFail avgt 30 41.615 ? 1.203 >> ns/op >> CaptureStateUtilBench.explicitAllocationSuccess avgt 30 23.094 ? 0.580 >> ns/op >> CaptureStateUtilBench.threadLocalFail avgt 30 14.760 ? 0.078 >> ns/op >> CaptureStateUtilBench.threadLocalReuseSuccess avgt 30 7.189 ? 0.151 >> ns/op >> >> >> Explicit allocation: >> >> try (var arena = Arena.ofConfined()) { >> return (int) HANDLE.invoke(arena.allocate(4), 0, 0); >> } >> >> >> Thread Local (tl): >> >> return (int) ADAPTED_HANDLE.invoke(arena.allocate(4), 0, 0); >> >> >> The graph below shows the difference in latency for a successful call: >> >>  >> >> This is a ~3x improvement for both the happy and the error path. >> >> >> Tested and passed tiers 1-3. > > Per Minborg has updated the pull request incrementally with one additional > commit since the last revision: > > Remove unused class As mentioned in https://mail.openjdk.org/pipermail/panama-dev/2025-January/020882.html, I've been looking at pretty much exactly the same thing for another buffer required in the call sequence, with slightly different takeaways: * the implementation in your PR always leaves ownership of the segments inside the thread-local. Borrowing/releasing threads need to check "can I take this", "am I still on the right carrier, trying to return this to the right slot?". I find it easier to actually remove the elements from the cache slot, and potentially return them to the cache slot of a different CT. Ownership and responsibilities seem clearer to me this way. * I've built on the assumption that the "CTL.get.test/modify" sequences are actually atomic wrt virtual-thread-scheduling (seeing NIO code that seems to rely on the same) and no mounting/unmounting will happen inside them (but _may_ happen on the acquire-invoke-release bracket; this I have actually observed to be the case between invoke and release). If rescheduling can happen inside these sequences and we have multiple threads contending for the CT-local element, I'm not 100% sure that there is a point to using CTL in the first place? src/java.base/share/classes/jdk/internal/foreign/CaptureStateUtil.java line 333: > 331: SegmentPool cache = TL_POOLS.get(); > 332: if (cache == null) { > 333: TL_POOLS.set(cache = new SegmentPool()); Why not use `initialValue`? src/java.base/share/classes/jdk/internal/util/SingleElementPool.java line 124: > 122: public void close() { > 123: if (UNSAFE.compareAndSetInt(this, CLOSED_OFFSET, FALSE, > TRUE)) { > 124: if (UNSAFE.getIntVolatile(this, > POOLED_ELEMENT_TAKEN_OFFSET) == FALSE) { If we follow the premise that "any scheduling can happen", then this `close` could be racing with another thread trying to borrow the pooled element of this already-dead carrier thread: VT@CT1: pool = tl.get() unmount CT1 dies => pool1.close taken == false => recycle VT@CT2 mount pool.take() succeeds since `taken` still false BOOM I believe the close action would need to "take" the element before recycling it. Wouldn't this obviate the need for a separate "closed" slot? Again, I'm not sure whether this scheduling is actually possible at this time. src/java.base/share/classes/jdk/internal/util/SingleElementPool.java line 132: > 130: @ForceInline > 131: private boolean takePooledElement() { > 132: return UNSAFE.getAndSetInt(this, > POOLED_ELEMENT_TAKEN_OFFSET, TRUE) == FALSE; Does this need to use atomics? Looking at similar code in sun.nio.ch.Util.BufferCache which seems to be working without. My interpretation is that BufferCache relies on a virtual thread to not get preempted inside the { CTL.get().takeElement } sequence? Would the same reasoning not work here? I.e. element = carrierThreadLocal.get() [1] if (element.x) [2] a) can this virtual thread be moved to another carrier on [2] or [3] element.y [3] b) can this virtual thread be suspended and another one schedule between [1] and [2]/[3]? My understanding is that in the current implementation of virtual threads, this does not happen and we can manage the cache element with straight non-atomic code (correct)? If not correct, how does the BufferCache manage (pinned somewhere?)? src/java.base/share/classes/jdk/internal/util/SingleElementPool.java line 137: > 135: @ForceInline > 136: private void releasePooledElement() { > 137: UNSAFE.putIntVolatile(this, POOLED_ELEMENT_TAKEN_OFFSET, > FALSE); Similar to above: the use of volatile suggests that you're concerned about a virtual thread switch. Would be good to have clarification on this. ------------- PR Review: https://git.openjdk.org/jdk/pull/22391#pullrequestreview-2556247903 PR Review Comment: https://git.openjdk.org/jdk/pull/22391#discussion_r1918610681 PR Review Comment: https://git.openjdk.org/jdk/pull/22391#discussion_r1918706463 PR Review Comment: https://git.openjdk.org/jdk/pull/22391#discussion_r1918637436 PR Review Comment: https://git.openjdk.org/jdk/pull/22391#discussion_r1918648090