Re: RFR: 8338383: Implement JEP 491: Synchronize Virtual Threads without Pinning

Axel Boldt-Christmas Wed, 06 Nov 2024 09:41:47 -0800

On Thu, 17 Oct 2024 14:28:30 GMT, Patricio Chilano Mateo 
<[email protected]> wrote:


> This is the implementation of JEP 491: Synchronize Virtual Threads without 
> Pinning. See [JEP 491](https://bugs.openjdk.org/browse/JDK-8337395) for 
> further details.
> 
> In order to make the code review easier the changes have been split into the 
> following initial 4 commits:
> 
> - Changes to allow unmounting a virtual thread that is currently holding 
> monitors.
> - Changes to allow unmounting a virtual thread blocked on synchronized trying 
> to acquire the monitor.
> - Changes to allow unmounting a virtual thread blocked in `Object.wait()` and 
> its timed-wait variants.
> - Changes to tests, JFR pinned event, and other changes in the JDK libraries.
> 
> The changes fix pinning issues for all 4 ports that currently implement 
> continuations: x64, aarch64, riscv and ppc. Note: ppc changes were added 
> recently and stand in its own commit after the initial ones.
> 
> The changes fix pinning issues when using `LM_LIGHTWEIGHT`, i.e. the default 
> locking mode, (and `LM_MONITOR` which comes for free), but not when using 
> `LM_LEGACY` mode. Note that the `LockingMode` flag has already been 
> deprecated ([JDK-8334299](https://bugs.openjdk.org/browse/JDK-8334299)), with 
> the intention to remove `LM_LEGACY` code in future releases.
> 
> 
> ## Summary of changes
> 
> ### Unmount virtual thread while holding monitors
> 
> As stated in the JEP, currently when a virtual thread enters a synchronized 
> method or block, the JVM records the virtual thread's carrier platform thread 
> as holding the monitor, not the virtual thread itself. This prevents the 
> virtual thread from being unmounted from its carrier, as ownership 
> information would otherwise go wrong. In order to fix this limitation we will 
> do two things:
> 
> - We copy the oops stored in the LockStack of the carrier to the stackChunk 
> when freezing (and clear the LockStack). We copy the oops back to the 
> LockStack of the next carrier when thawing for the first time (and clear them 
> from the stackChunk). Note that we currently assume carriers don't hold 
> monitors while mounting virtual threads.
> 
> - For inflated monitors we now record the `java.lang.Thread.tid` of the owner 
> in the ObjectMonitor's `_owner` field instead of a JavaThread*. This allows 
> us to tie the owner of the monitor to a `java.lang.Thread` instance, rather 
> than to a JavaThread which is only created per platform thread. The tid is 
> already a 64 bit field so we can ignore issues of the counter wrapping around.
> 
> #### General notes about this part:
> 
> - Since virtual threads don't need to worry about holding monitors anymo...

I've done an initial look through of the hotspot changes.

In addition to my comments, I have looked at two more things.

One is to remove the _waiters reference counter from deflation and only use the 
_contentions reference counter. As well as tying the _contentions reference 
counter to the ObjectWaiter, so that it is easier to follow its lifetime, 
instead of these naked add_to_contentions, now that the ObjectWaiter does not 
have a straight forward scope, but can be frozen, and thawed on different 
threads. 46dacdf96999154e808d21e80b4d4e87f73bc802

Then I looked at typing up the thread / lock ids as an enum class 
34221f4a50a492cad4785cfcbb4bef8fa51d6f23

Either of these could be future RFEs.

Good work! I'll approve the GC related changes.

There are some simplifications I think can be done in the ObjectMonitor layer, 
but nothing that should go into this PR.

Similarly, (even if some of this is preexisting issues) I think that the way we 
describe the frames and the different frame transitions should be overhauled 
and made easier to understand. There are so many unnamed constants and 
adjustments which are spread out everywhere, which makes it hard to get an 
overview of exactly what happens and what interactions are related to what.  
You and Dean did a good job at simplifying and adding comments in this PR. But 
I hope this can be improved in the fututre.

A small note on `_cont_fastpath`, as it is now also used for synchronised 
native method calls (native wrapper) maybe the comment should be updated to 
reflect this.

// the sp of the oldest known interpreted/call_stub frame inside the
// continuation that we know about

src/hotspot/cpu/aarch64/c1_Runtime1_aarch64.cpp line 231:

> 229: 
> 230: StubFrame::~StubFrame() {
> 231:   __ epilogue(_use_pop_on_epilogue);

Can we not hook the `_use_pop_on_epilogue` into `return_state_t`, simplify the 
constructors and keep the old should_not_reach_here guard for stubs which 
should not return?
e.g.
```C++
enum return_state_t {
  does_not_return, requires_return, requires_pop_epilogue_return
};

StubFrame::~StubFrame() {
  if (_return_state == does_not_return) {
    __ should_not_reach_here();
  } else {
    __ epilogue(_return_state == requires_pop_epilogue_return);
  }
}

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 115:

> 113:   // The object's monitor m is unlocked iff m->owner == nullptr,
> 114:   // otherwise m->owner may contain a thread id, a stack address for 
> LM_LEGACY,
> 115:   // or the ANONYMOUS_OWNER constant for LM_LIGHTWEIGHT.

Comment seems out of place in `LockingMode != LM_LIGHTWEIGHT` code.

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 380:

> 378:     lea(t2_owner_addr, owner_address);
> 379: 
> 380:     // CAS owner (null => current thread id).

I think we should be more careful when and where we talk about thread id and 
lock id respectively. Given that `switchToCarrierThread` switches the thread, 
but not the lock id. We should probably define and talk about the lock id when 
it comes to locking, as saying thread id may be incorrect. 

Then there is also the different thread ids, the OS level one, and the java 
level one. (But not sure how to reconcile this without causing confusion)

src/hotspot/cpu/aarch64/continuationFreezeThaw_aarch64.inline.hpp line 300:

> 298:   CodeBlob* cb = top.cb();
> 299: 
> 300:   if (cb->frame_size() == 2) {

Is this a filter to identify c2 runtime stubs? Is there some other property we 
can check or assert here? This assumes that no other runtime frame will have 
this size.

src/hotspot/cpu/aarch64/continuationFreezeThaw_aarch64.inline.hpp line 313:

> 311: 
> 312:     log_develop_trace(continuations, preempt)("adjusted sp for c2 
> runtime stub, initial sp: " INTPTR_FORMAT " final sp: " INTPTR_FORMAT
> 313:                                               " fp: " INTPTR_FORMAT, 
> p2i(sp + frame::metadata_words), p2i(sp), sp[-2]);

Is there a reason for the mix of `2` and `frame::metadata_words`?

Maybe this could be
```C++
    intptr_t* const unadjusted_sp = sp;
    sp -= frame::metadata_words;
    sp[-2] = unadjusted_sp[-2];
    sp[-1] = unadjusted_sp[-1];

    log_develop_trace(continuations, preempt)("adjusted sp for c2 runtime stub, 
initial sp: " INTPTR_FORMAT " final sp: " INTPTR_FORMAT
                                              " fp: " INTPTR_FORMAT, 
p2i(unadjusted_sp), p2i(sp), sp[-2]);

src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp line 1275:

> 1273: void SharedRuntime::continuation_enter_cleanup(MacroAssembler* masm) {
> 1274:   ::continuation_enter_cleanup(masm);
> 1275: }

Now that `continuation_enter_cleanup` is a static member function, just merge 
the static free function with this static member function.

src/hotspot/cpu/x86/assembler_x86.cpp line 2866:

> 2864:     emit_int32(0);
> 2865:   }
> 2866: }

Is it possible to make this more general and explicit instead of a sequence of 
bytes?

Something along the lines of:
```C++
  const address tar = L.is_bound() ? target(L) : pc();
  const Address adr = Address(checked_cast<int32_t>(tar - pc()), tar, 
relocInfo::none);

  InstructionMark im(this);
  emit_prefix_and_int8(get_prefixq(adr, dst), (unsigned char)0x8D);
  if (!L.is_bound()) {
    // Patch @0x8D opcode
    L.add_patch_at(code(), CodeBuffer::locator(offset() - 1, sect()));
  }
  // Register and [rip+disp] operand
  emit_modrm(0b00, raw_encode(dst), 0b101);
  // Adjust displacement by sizeof lea instruction
  int32_t disp = adr.disp() - checked_cast<int32_t>(pc() - inst_mark() + 
sizeof(int32_t));
  assert(is_simm32(disp), "must be 32bit offset [rip+offset]");
  emit_int32(disp);


and then in `pd_patch_instruction` simply match `op == 0x8D /* lea */`.

src/hotspot/share/oops/stackChunkOop.cpp line 445:

> 443: 
> 444: void stackChunkOopDesc::transfer_lockstack(oop* dst) {
> 445:   const bool requires_gc_barriers = is_gc_mode() || requires_barriers();

Given how careful we are in `Thaw` to not call `requires_barriers()` twice and 
use `_barriers` instead it would probably be nicer to pass in `_barriers` as a 
bool. 

There is only one other place we do the extra call and it is in 
`fix_thawed_frame`, but that only happens after we are committed to the slow 
path, so it might be nice for completeness, but should be negligible for 
performance. Here however we might still be in our new "medium" path where we 
could still do a fast thaw.

src/hotspot/share/oops/stackChunkOop.cpp line 460:

> 458:     } else {
> 459:       oop value = *reinterpret_cast<oop*>(at);
> 460:       HeapAccess<>::oop_store(reinterpret_cast<oop*>(at), nullptr);

Using HeapAccess when `!requires_gc_barriers` is wrong. This would crash with 
ZGC when/if we fix the flags race and changed `relativize_chunk_concurrently` 
to only be conditioned `requires_barriers() / _barriers` (and allowing the 
retry_fast_path "medium" path). 
So either use `*reinterpret_cast<oop*>(at) = nullptr;` or do what my initial 
suggestion with `clear_lockstack` did, just omit the clearing. Before we 
requires_barriers(), we are allowed to reuse the stackChuncks, so trying to 
clean them up seems fruitless.

src/hotspot/share/oops/stackChunkOop.cpp line 471:

> 469:     }
> 470:   }
> 471: }

Can we turn these three very similar loops into one? In my opinion, it is 
easier to parse.

```C++
void stackChunkOopDesc::copy_lockstack(oop* dst) {
  const int cnt = lockstack_size();
  const bool requires_gc_barriers = is_gc_mode() || requires_barriers();
  const bool requires_uncompress = requires_gc_barriers && has_bitmap() && 
UseCompressedOops;
  const auto get_obj = [&](intptr_t* at) -> oop {
    if (requires_gc_barriers) {
      if (requires_uncompress) {
        return HeapAccess<>::oop_load(reinterpret_cast<narrowOop*>(at));
      }
      return HeapAccess<>::oop_load(reinterpret_cast<oop*>(at));
    }
    return *reinterpret_cast<oop*>(at);
  };

  intptr_t* lockstack_start = start_address();
  for (int i = 0; i < cnt; i++) {
    oop mon_owner = get_obj(&lockstack_start[i]);
    assert(oopDesc::is_oop(mon_owner), "not an oop");
    dst[i] = mon_owner;
  }
}

src/hotspot/share/prims/jvmtiExport.cpp line 1681:

> 1679:   EVT_TRIG_TRACE(EXT_EVENT_VIRTUAL_THREAD_UNMOUNT, ("[%p] Trg Virtual 
> Thread Unmount event triggered", vthread));
> 1680: 
> 1681:   // On preemption JVMTI state rebinding has already happened so get it 
> always direclty from the oop.

Suggestion:

  // On preemption JVMTI state rebinding has already happened so get it always 
directly from the oop.

src/hotspot/share/runtime/continuationFreezeThaw.cpp line 2234:

> 2232:     retry_fast_path = true;
> 2233:   } else {
> 2234:     relativize_chunk_concurrently(chunk);

Is the `relativize_chunk_concurrently` solution to the race only to have a 
single flag read in `can_thaw_fast` or is there some other subtlety here? 

While not required for the PR, if it is just to optimise the `can_thaw_fast` 
check, it can probably be made to work with one load and still allow concurrent 
gcs do fast_thaw when we only get here due to a lockstack.

src/hotspot/share/runtime/continuationFreezeThaw.cpp line 2247:

> 2245:     _thread->lock_stack().move_from_address(tmp_lockstack, 
> lockStackSize);
> 2246: 
> 2247:     chunk->set_lockstack_size(0);

After some discussion here at the office we think there might be an issue here 
with simply hiding the oops without clearing them. Below in `recurse_thaw` we 
`do_barriers`. But it does not touch these lockstack. Missing the SATB store 
barrier is probably fine from a liveness perspective, because the oops in the 
lockstack must also be in the frames. But removing the oops without a barrier 
and clear will probably lead to problems down the line.

Something like the following would probably handle this. Or even fuse the 
`copy_lockstack` and `clear_lockstack` together into some kind of 
`transfer_lockstack` which both loads and clears the oops.


diff --git a/src/hotspot/share/oops/stackChunkOop.cpp 
b/src/hotspot/share/oops/stackChunkOop.cpp
index d3d63533eed..f737bd2db71 100644
--- a/src/hotspot/share/oops/stackChunkOop.cpp
+++ b/src/hotspot/share/oops/stackChunkOop.cpp
@@ -470,6 +470,28 @@ void stackChunkOopDesc::copy_lockstack(oop* dst) {
   }
 }
 
+void stackChunkOopDesc::clear_lockstack() {
+  const int cnt = lockstack_size();
+  const bool requires_gc_barriers = is_gc_mode() || requires_barriers();
+  const bool requires_uncompress = has_bitmap() && UseCompressedOops;
+  const auto clear_obj = [&](intptr_t* at) {
+    if (requires_uncompress) {
+      HeapAccess<>::oop_store(reinterpret_cast<narrowOop*>(at), nullptr);
+    } else {
+      HeapAccess<>::oop_store(reinterpret_cast<oop*>(at), nullptr);
+    }
+  };
+
+  if (requires_gc_barriers) {
+    intptr_t* lockstack_start = start_address();
+    for (int i = 0; i < cnt; i++) {
+      clear_obj(&lockstack_start[i]);
+    }
+  }
+  set_lockstack_size(0);
+  set_has_lockstack(false);
+}
+
 void stackChunkOopDesc::print_on(bool verbose, outputStream* st) const {
   if (*((juint*)this) == badHeapWordVal) {
     st->print_cr("BAD WORD");
diff --git a/src/hotspot/share/oops/stackChunkOop.hpp 
b/src/hotspot/share/oops/stackChunkOop.hpp
index 28e0576801e..928e94dd695 100644
--- a/src/hotspot/share/oops/stackChunkOop.hpp
+++ b/src/hotspot/share/oops/stackChunkOop.hpp
@@ -167,6 +167,7 @@ class stackChunkOopDesc : public instanceOopDesc {
   void fix_thawed_frame(const frame& f, const RegisterMapT* map);
 
   void copy_lockstack(oop* start);
+  void clear_lockstack();
 
   template <typename OopT, class StackChunkLockStackClosureType>
   inline void iterate_lockstack(StackChunkLockStackClosureType* closure);
diff --git a/src/hotspot/share/runtime/continuationFreezeThaw.cpp 
b/src/hotspot/share/runtime/continuationFreezeThaw.cpp
index 5b6e48a02f3..e7d505bb9b1 100644
--- a/src/hotspot/share/runtime/continuationFreezeThaw.cpp
+++ b/src/hotspot/share/runtime/continuationFreezeThaw.cpp
@@ -2244,8 +2244,7 @@ NOINLINE intptr_t* Thaw<ConfigT>::thaw_slow(stackChunkOop 
chunk, Continuation::t
     chunk->copy_lockstack(tmp_lockstack);
     _thread->lock_stack().move_from_address(tmp_lockstack, lockStackSize);
 
-    chunk->set_lockstack_size(0);
-    chunk->set_has_lockstack(false);
+    chunk->clear_lockstack();
     retry_fast_path = true;
   }
 ```

src/hotspot/share/runtime/continuationFreezeThaw.cpp line 2538:

> 2536:   Method* m = hf.interpreter_frame_method();
> 2537:   // For native frames we need to count parameters, possible alignment, 
> plus the 2 extra words (temp oop/result handler).
> 2538:   const int locals = !m->is_native() ? m->max_locals() : 
> m->size_of_parameters() + frame::align_wiggle + 2;

Is it possible to have these extra native frame slots size be a named constant 
/ enum value on `frame`? I think it is used in a couple of places.

src/hotspot/share/runtime/frame.cpp line 535:

> 533:   assert(get_register_address_in_stub(f, 
> SharedRuntime::thread_register()) == (address)thread_addr, "wrong thread 
> address");
> 534:   return thread_addr;
> 535: #endif

With this ifdef, it seems like this belongs in the platform dependent part of 
the frame class.

src/hotspot/share/runtime/javaThread.cpp line 1545:

> 1543:     if (is_vthread_mounted()) {
> 1544:       // _lock_id is the thread ID of the mounted virtual thread
> 1545:       st->print_cr("   Carrying virtual thread #" INT64_FORMAT, 
> lock_id());

What is the interaction here with `switchToCarrierThread` and the window 
between?

        carrier.setCurrentThread(carrier);
        Thread.setCurrentLockId(this.threadId());

Will we print the carrier threads id as a virtual threads id? (I am guessing 
that is_vthread_mounted is true when switchToCarrierThread is called).

src/hotspot/share/runtime/objectMonitor.hpp line 184:

> 182:   // - We test for anonymous owner by testing for the lowest bit, 
> therefore
> 183:   //   DEFLATER_MARKER must *not* have that bit set.
> 184:   static const int64_t DEFLATER_MARKER = 2;

The comments here should be updated / removed. They are talking about the lower 
bits of the owner being unset which is no longer true. (And talks about doing 
bit tests, which I do not think is done anywhere even without this patch).

src/hotspot/share/runtime/objectMonitor.hpp line 186:

> 184:   static const int64_t DEFLATER_MARKER = 2;
> 185: 
> 186:   int64_t volatile _owner;  // Either tid of owner, 
> ANONYMOUS_OWNER_MARKER or DEFLATER_MARKER.

Suggestion:

  int64_t volatile _owner;  // Either tid of owner, NO_OWNER, ANONYMOUS_OWNER 
or DEFLATER_MARKER.

src/hotspot/share/runtime/objectMonitor.inline.hpp line 50:

> 48: inline int64_t ObjectMonitor::owner_from(oop vthread) {
> 49:   int64_t tid = java_lang_Thread::thread_id(vthread);
> 50:   assert(tid >= 3 && tid < ThreadIdentifier::current(), "must be 
> reasonable");

Suggestion:

  assert(tid >= ThreadIdentifier::initial() && tid < 
ThreadIdentifier::current(), "must be reasonable");

src/hotspot/share/runtime/synchronizer.cpp line 1467:

> 1465:       markWord dmw = inf->header();
> 1466:       assert(dmw.is_neutral(), "invariant: header=" INTPTR_FORMAT, 
> dmw.value());
> 1467:       if (inf->is_owner_anonymous() && inflating_thread != nullptr) {

Are these `LM_LEGACY` + `ANONYMOUS_OWNER` changes still required now that 
`LM_LEGACY` does no freeze?

src/java.base/share/classes/jdk/internal/vm/Continuation.java line 62:

> 60:         NATIVE(2, "Native frame or <clinit> on stack"),
> 61:         MONITOR(3, "Monitor held"),
> 62:         CRITICAL_SECTION(4, "In critical section");

Is there a reason that the `reasonCode` values does not match the 
`freeze_result` reason values used in `pinnedReason(int reason)` to create one 
of these? 

I cannot see that it is used either. Only seem to be read for JFR 
VirtualThreadPinned Event which only uses the string.

-------------

PR Review: https://git.openjdk.org/jdk/pull/21565#pullrequestreview-2381051930
Marked as reviewed by aboldtch (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/21565#pullrequestreview-2417363171
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1808181783
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1808189977
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1808208652
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1808282892
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1808261926
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1808318304
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1808358874
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1825949756
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1825942254
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1808706427
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1808809374
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1810772765
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1810764911
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1808460330
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809032469
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809065834
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809091338
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809092367
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1829464866
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809111830
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1827276764

Re: RFR: 8338383: Implement JEP 491: Synchronize Virtual Threads without Pinning

Reply via email to