On Wed, 10 Jun 2026 17:35:43 GMT, Mat Carter <[email protected]> wrote:
> With -UseLSE the Starvation test on Windows ARM64 timeouts close to 100% > whereas this only happens on Linux ARM64 on larger machines with many cores. > The issue is that C2 outputs an LDR following the CAS in LinkedTransferQueue > which can execute before the STLXR breaking the Dekker protocol. > > Replacing the LDR with LADR by using getAcquire solves the issue as it won't > be reordered before the STLXR. This does impact the +UseLSE case as the LADR > was not necessary and is slightly more expensive than LDR. But to handle > this case would require larger changes to Hotspot > > Starvation test passes on Windows ARM64 and Linux ARM64, with no regressions > on tier1 > > --------- > - [x] I confirm that I make this contribution in accordance with the [OpenJDK > Interim AI Policy](https://openjdk.org/legal/ai). OK, so we have the following `cmpxchg`: ```c++ // UseLSE mov result, expected casal result, new_val, [addr] cmp result, expected // !UseLSE prfm pstl1strm, [addr] retry: ldaxr result, [addr] cmp result, expected b.ne done stlxr rscratch1, new_val, [addr] cbnz rscratch1, retry done: Well, in the `!UseLSE` case a `ldr` after `done:` may move up, `stlxr` doesn't prevent that. OK, sure, so the load may float. As far as I know, the same is true for `casal`, in that its store effect is not necessarily performed before the `ldr`. Alright, so what's up with park and unpark then? DualNode:🏞 this.waiter = Thread::currentThread(); FullFence(); yield; DualNode::unpark: bool success = CAS(p, m, e) // src, expected, new_value if (!success) { Thread w = p.waiter; } I assume that the failed CAS signals that there is a waiter, but since the read of the `p.waiter` isn't ordered with the CAS, it may read null and the waiting thread is never unparked. Yeah, so this code really needs the load-acquire, as this is a Dekker-style thingie. I'm pretty sure that this is broken in the `UseLSE` case as well, it's just less likely to observe the borked behavior. The code is generated by: ```c++ void MacroAssembler::cmpxchg(Register addr, Register expected, Register new_val, enum operand_size size, bool acquire, bool release, bool weak, Register result) { if (result == noreg) result = rscratch1; BLOCK_COMMENT("cmpxchg {"); if (UseLSE) { mov(result, expected); lse_cas(result, new_val, addr, size, acquire, release, /*not_pair*/ true); compare_eq(result, expected, size); #ifdef ASSERT // Poison rscratch1 which is written on !UseLSE branch mov(rscratch1, 0x1f1f1f1f1f1f1f1f); #endif } else { Label retry_load, done; prfm(Address(addr), PSTL1STRM); bind(retry_load); load_exclusive(result, addr, size, acquire); compare_eq(result, expected, size); br(Assembler::NE, done); store_exclusive(rscratch1, new_val, addr, size, release); if (weak) { cmpw(rscratch1, 0u); // If the store fails, return NE to our caller. } else { cbnzw(rscratch1, retry_load); } bind(done); } BLOCK_COMMENT("} cmpxchg"); } ------------- PR Comment: https://git.openjdk.org/jdk/pull/31465#issuecomment-4710767471
