On Thu, 30 Apr 2026 02:35:35 GMT, David Holmes <[email protected]> wrote:

>> Please review this change to fix a problem that can rise if JVM TI 
>> suspension is allowed when a thread is executing in a JNI "critical" region. 
>> The gory details are in the first comment so that the PR emails are shorter
>> 
>> A new test is introduced to check that we cannot suspend in a critical region
>> 
>> Other testing:
>> - Tiers 1-5 on all platforms
>> 
>> The key insights into this solution are attributed to @pchilano. Everything 
>> simpler I tried was buggy and led me back to Patricio's suggested changes to 
>> the operation filtering. The actual details of this and any remaining bugs 
>> in it are all my own.
>> 
>> Thanks.
>> 
>> ---------
>> - [x] I confirm that I make this contribution in accordance with the 
>> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai).
>
> David Holmes has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fix typo

Changes look good. Only found one issue I described below.

src/hotspot/share/runtime/handshake.cpp line 718:

> 716: 
> 717:   HandshakeOperation* op = get_op();
> 718:   assert(op != nullptr, "Must have an op");

There is a subtle race now where we can hit this assert. The sequence of events 
go as follows:
1 - `Thread1` adds handshake operation to `TargetThread's` queue. Sees 
`TargetThread's` as handshake safe and proceeds to `claim_handshake()`.
2 - Before `Thread1` calls `claim_handshake`, `TargetThread’s` processes the 
operation and removes it from the queue.
3 - `TargetThread’s` calls `GetPrimitiveArrayCritical`. Before incrementing 
`_jni_deferred_suspension_count`, `Thread2` adds suspend handshake to 
`TargetThread’s` queue.
4 - `Thread1` calls `claim_handshake` and sees that there is a pending 
operation that is enabled.
5 - `TargetThread’s` increments `_jni_deferred_suspension_count` and continues 
execution in critical native code.
6 - `Thread1` calls `can_process_handshake()` and sees `TargetThread` in 
native. Calls `get_op` but the only operation in the queue is not enabled.

I created a test with a couple of artificial delays in the VM which triggers 
this assert: 
https://github.com/openjdk/jdk/compare/pr/30936...pchilano:jdk:8373839-jni-crit-suspend-stresstest

One way of fixing this could be to not allow others threads besides the 
requester/target to process the suspend handshake. That means 
`HandshakeClosure::is_enabled` would need to also take `_requester`.
I’m also thinking if we could move `can_process_handshake` to `claim_handshake` 
after grabbing the monitor and before checking the queue, i.e. we first check 
if the thread is in a handshake safe state.

-------------

PR Review: https://git.openjdk.org/jdk/pull/30936#pullrequestreview-4222819289
PR Review Comment: https://git.openjdk.org/jdk/pull/30936#discussion_r3183928461

Reply via email to