On Mon, 28 Oct 2024 13:53:58 GMT, Jorn Vernee <jver...@openjdk.org> wrote:
> There is a subtle race in `UpcallLinker::on_exit` between copying of the old > frame anchor back into place, and the GC. Since this copy is not atomic, it > may briefly appear as if a thread has no last Java frame, while still in the > `_thread_in_native` state, which leads to the GC skipping processing of any > active Java frames. > > This code was originally adapted from `JavaCallWrapper::!JavaCallWrapper` - > the JNI mechanism for upcalls - but in that case the frame anchor copy > happens in the `_thread_in_vm` state, which means the GC will wait for the > thread to get to a safepoint. > > The solution proposed here is to do the frame anchor copy in the java thread > state, before transitioning back to the native state. The java thread state, > like the vm thread state, is also 'safe' i.e. the GC will wait for the thread > to get to a safepoint, so we can safely do our non-atomic copy of the frame > anchor. > > Additionally, this PR resolves a similar issue in `on_entry`, by moving the > clearing of the pending exception (in case native code use a JNI API and > didn't handle the exception afterwards). We now also skip checking for async > exceptions when transitioning from native to java, so we don't immediately > clear them. Any async exceptions will be picked up at the next safepoint > instead. > > Special thanks to @stefank and @fisk for finding the root cause, and > @jaikiran for testing and debugging. > > Testing: tier 1-4, 20k runs of the failing test on linux-aarch64. lgtm. Would be nice if if we could assert that we are not in native or blocked when touching the oops as well. Similarly to modifications of the frame anchor. But I agree that it should be done separately. ------------- Marked as reviewed by aboldtch (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/21742#pullrequestreview-2464545413