On Thu, 16 Jan 2025 11:58:20 GMT, Per Minborg <[email protected]> wrote:
>> Going forward, converting older JDK code to use the relatively new FFM API
>> requires system calls that can provide `errno` and the likes to explicitly
>> allocate a MemorySegment to capture potential error states. This can lead to
>> negative performance implications if not designed carefully and also
>> introduces unnecessary code complexity.
>>
>> Hence, this PR proposes to add a _JDK internal_ method handle adapter that
>> can be used to handle system calls with `errno`, `GetLastError`, and
>> `WSAGetLastError`.
>>
>> It currently relies on a thread-local cache of MemorySegments to allide
>> allocations. If, in the future, a more efficient thread-associated
>> allocation scheme becomes available, we could easily migrate to that one.
>>
>> Here are some benchmarks:
>>
>>
>> Benchmark Mode Cnt Score Error
>> Units
>> CaptureStateUtilBench.explicitAllocationFail avgt 30 41.615 ? 1.203
>> ns/op
>> CaptureStateUtilBench.explicitAllocationSuccess avgt 30 23.094 ? 0.580
>> ns/op
>> CaptureStateUtilBench.threadLocalFail avgt 30 14.760 ? 0.078
>> ns/op
>> CaptureStateUtilBench.threadLocalReuseSuccess avgt 30 7.189 ? 0.151
>> ns/op
>>
>>
>> Explicit allocation:
>>
>> try (var arena = Arena.ofConfined()) {
>> return (int) HANDLE.invoke(arena.allocate(4), 0, 0);
>> }
>>
>>
>> Thread Local (tl):
>>
>> return (int) ADAPTED_HANDLE.invoke(arena.allocate(4), 0, 0);
>>
>>
>> The graph below shows the difference in latency for a successful call:
>>
>> 
>>
>> This is a ~3x improvement for both the happy and the error path.
>>
>>
>> Tested and passed tiers 1-3.
>
> Per Minborg has updated the pull request incrementally with one additional
> commit since the last revision:
>
> Remove unused class
As mentioned in
https://mail.openjdk.org/pipermail/panama-dev/2025-January/020882.html, I've
been looking at pretty much exactly the same thing for another buffer required
in the call sequence, with slightly different takeaways:
* the implementation in your PR always leaves ownership of the segments inside
the thread-local. Borrowing/releasing threads need to check "can I take this",
"am I still on the right carrier, trying to return this to the right slot?". I
find it easier to actually remove the elements from the cache slot, and
potentially return them to the cache slot of a different CT. Ownership and
responsibilities seem clearer to me this way.
* I've built on the assumption that the "CTL.get.test/modify" sequences are
actually atomic wrt virtual-thread-scheduling (seeing NIO code that seems to
rely on the same) and no mounting/unmounting will happen inside them (but _may_
happen on the acquire-invoke-release bracket; this I have actually observed to
be the case between invoke and release). If rescheduling can happen inside
these sequences and we have multiple threads contending for the CT-local
element, I'm not 100% sure that there is a point to using CTL in the first
place?
src/java.base/share/classes/jdk/internal/foreign/CaptureStateUtil.java line 333:
> 331: SegmentPool cache = TL_POOLS.get();
> 332: if (cache == null) {
> 333: TL_POOLS.set(cache = new SegmentPool());
Why not use `initialValue`?
src/java.base/share/classes/jdk/internal/util/SingleElementPool.java line 124:
> 122: public void close() {
> 123: if (UNSAFE.compareAndSetInt(this, CLOSED_OFFSET, FALSE,
> TRUE)) {
> 124: if (UNSAFE.getIntVolatile(this,
> POOLED_ELEMENT_TAKEN_OFFSET) == FALSE) {
If we follow the premise that "any scheduling can happen", then this `close`
could be racing with another thread trying to borrow the pooled element of this
already-dead carrier thread:
VT@CT1:
pool = tl.get()
unmount
CT1 dies =>
pool1.close
taken == false => recycle
VT@CT2
mount
pool.take() succeeds since `taken` still false
BOOM
I believe the close action would need to "take" the element before recycling it.
Wouldn't this obviate the need for a separate "closed" slot?
Again, I'm not sure whether this scheduling is actually possible at this time.
src/java.base/share/classes/jdk/internal/util/SingleElementPool.java line 132:
> 130: @ForceInline
> 131: private boolean takePooledElement() {
> 132: return UNSAFE.getAndSetInt(this,
> POOLED_ELEMENT_TAKEN_OFFSET, TRUE) == FALSE;
Does this need to use atomics? Looking at similar code in
sun.nio.ch.Util.BufferCache which seems to be working without. My
interpretation is that BufferCache relies on a virtual thread to not get
preempted inside the { CTL.get().takeElement } sequence? Would the same
reasoning not work here?
I.e.
element = carrierThreadLocal.get() [1]
if (element.x) [2] a) can this virtual thread be moved
to another carrier on [2] or [3]
element.y [3] b) can this virtual thread be
suspended and another one schedule between [1] and [2]/[3]?
My understanding is that in the current implementation of virtual threads, this
does not happen and we can manage the cache element with straight non-atomic
code (correct)? If not correct, how does the BufferCache manage (pinned
somewhere?)?
src/java.base/share/classes/jdk/internal/util/SingleElementPool.java line 137:
> 135: @ForceInline
> 136: private void releasePooledElement() {
> 137: UNSAFE.putIntVolatile(this, POOLED_ELEMENT_TAKEN_OFFSET,
> FALSE);
Similar to above: the use of volatile suggests that you're concerned about a
virtual thread switch. Would be good to have clarification on this.
-------------
PR Review: https://git.openjdk.org/jdk/pull/22391#pullrequestreview-2556247903
PR Review Comment: https://git.openjdk.org/jdk/pull/22391#discussion_r1918610681
PR Review Comment: https://git.openjdk.org/jdk/pull/22391#discussion_r1918706463
PR Review Comment: https://git.openjdk.org/jdk/pull/22391#discussion_r1918637436
PR Review Comment: https://git.openjdk.org/jdk/pull/22391#discussion_r1918648090