On Mon, 28 Oct 2024 13:53:58 GMT, Jorn Vernee <jver...@openjdk.org> wrote:
> There is a subtle race in `UpcallLinker::on_exit` between copying of the old > frame anchor back into place, and the GC. Since this copy is not atomic, it > may briefly appear as if a thread has no last Java frame, while still in the > `_thread_in_native` state, which leads to the GC skipping processing of any > active Java frames. > > This code was originally adapted from `JavaCallWrapper::!JavaCallWrapper` - > the JNI mechanism for upcalls - but in that case the frame anchor copy > happens in the `_thread_in_vm` state, which means the GC will wait for the > thread to get to a safepoint. > > The solution proposed here is to do the frame anchor copy in the java thread > state, before transitioning back to the native state. The java thread state, > like the vm thread state, is also 'safe' i.e. the GC will wait for the thread > to get to a safepoint, so we can safely do our non-atomic copy of the frame > anchor. > > Additionally, this PR resolves a similar issue in `on_entry`, by moving the > clearing of the pending exception (in case native code use a JNI API and > didn't handle the exception afterwards). We now also skip checking for async > exceptions when transitioning from native to java, so we don't immediately > clear them. Any async exceptions will be picked up at the next safepoint > instead. > > Special thanks to @stefank and @fisk for finding the root cause, and > @jaikiran for testing and debugging. > > Testing: tier 1-4, 20k runs of the failing test on linux-aarch64. > > I wonder if we can assert we are in a safepoint-safe state when doing so? > > I think we can do this. I've prototyped this here: > [pr/21742...JornVernee:jdk:SafeFrameAnchor+assert](https://github.com/openjdk/jdk/compare/pr/21742...JornVernee:jdk:SafeFrameAnchor+assert) > > This catches the issue fixed by this patch, and it passes at least tier 1. > We'd need something similar in assembly where we touch the frame anchor, is > `MacroAssembler::set_last_Java_frame` and > `MacroAssembler::reset_last_Java_frame`. Thinking some more about this: there might be other instances of `JavaFrameAnchor` that are fine to touch when the thread is in the native state. It's just the frame anchor inside a `JavaThread` that can not be touched if that thread is in a certain state. It might be possible to encapsulate the `JavaFrameAnchor` instance inside the thread, and then guard any accesses to it. But, that seems like a much more invasive change, so I'll hold off on that and focus this PR on fixing the issue. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21742#issuecomment-2489692928