On Tue, 13 Aug 2024 16:03:16 GMT, Roman Kennke <rken...@openjdk.org> wrote:
>> I tried the following (see diff below) and it shows about a 5-10% regression >> in most the `LockUnlock.testInflated*` micros. Also tried with just >> `num_unrolled = 1` saw the same regression. Maybe there was some other >> pattern you were thinking of. There are probably architecture and platform >> differences. This can and should probably be explored in a followup PR. >> >> >> >> diff --git a/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp >> b/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp >> index 5dbfdbc225d..4e6621cfece 100644 >> --- a/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp >> +++ b/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp >> @@ -663,25 +663,28 @@ void C2_MacroAssembler::fast_lock_lightweight(Register >> obj, Register box, Regist >> >> const int num_unrolled = 2; >> for (int i = 0; i < num_unrolled; i++) { >> - cmpptr(obj, Address(t)); >> - jccb(Assembler::equal, monitor_found); >> - increment(t, in_bytes(OMCache::oop_to_oop_difference())); >> + Label next; >> + cmpptr(obj, Address(t, OMCache::oop_to_oop_difference() * i)); >> + jccb(Assembler::notEqual, next); >> + increment(t, in_bytes(OMCache::oop_to_oop_difference() * i)); >> + jmpb(monitor_found); >> + bind(next); >> } >> + increment(t, in_bytes(OMCache::oop_to_oop_difference() * >> (num_unrolled - 1))); >> >> Label loop; >> >> // Search for obj in cache. >> bind(loop); >> - >> - // Check for match. >> - cmpptr(obj, Address(t)); >> - jccb(Assembler::equal, monitor_found); >> - >> + // Advance. >> + increment(t, in_bytes(OMCache::oop_to_oop_difference())); >> // Search until null encountered, guaranteed _null_sentinel at end. >> cmpptr(Address(t), 1); >> jcc(Assembler::below, slow_path); // 0 check, but with ZF=0 when *t >> == 0 >> - increment(t, in_bytes(OMCache::oop_to_oop_difference())); >> - jmpb(loop); >> + >> + // Check for match. >> + cmpptr(obj, Address(t)); >> + jccb(Assembler::notEqual, loop); >> >> // Cache hit. >> bind(monitor_found); > > Yeah it's probably not very important. But it's not quite what I had in mind, > I was thinking more something like (aarch64 version, untested, may be wrong): > > > diff --git a/src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp > b/src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp > index 19af03d3488..05bbb5760b8 100644 > --- a/src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp > +++ b/src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp > @@ -302,14 +302,14 @@ void C2_MacroAssembler::fast_lock_lightweight(Register > obj, Register box, Regist > Label monitor_found; > > // Load cache address > - lea(t3_t, Address(rthread, JavaThread::om_cache_oops_offset())); > + lea(t3_t, Address(rthread, JavaThread::om_cache_oops_offset() - > OMCache::oop_to_oop_difference())); > > const int num_unrolled = 2; > for (int i = 0; i < num_unrolled; i++) { > + increment(t3_t, in_bytes(OMCache::oop_to_oop_difference())); > ldr(t1, Address(t3_t)); > cmp(obj, t1); > br(Assembler::EQ, monitor_found); > - increment(t3_t, in_bytes(OMCache::oop_to_oop_difference())); > } > > Label loop; > @@ -317,16 +317,14 @@ void C2_MacroAssembler::fast_lock_lightweight(Register > obj, Register box, Regist > // Search for obj in cache. > bind(loop); > > - // Check for match. > - ldr(t1, Address(t3_t)); > - cmp(obj, t1); > - br(Assembler::EQ, monitor_found); > + increment(t3_t, in_bytes(OMCache::oop_to_oop_difference())); > > + ldr(t1, Address(t3_t)); > // Search until null encountered, guaranteed _null_sentinel at end. > - increment(t3_t, in_bytes(OMCache::oop_to_oop_difference())); > - cbnz(t1, loop); > - // Cache Miss, NE set from cmp above, cbnz does not set flags > - b(slow_path); > + cbz(t1, slow_path); > + // Check for match. > + cmp(obj, t1); > + br(Assembler::NE, loop); > > bind(monitor_found); I see just the loop. I read it as the a cachehit should not take conditional branches. The `num_unrolled = 1` effectively became what you suggest, and showed similar regressions (but with only one unrolled lookup). And only tested on one specific machine. But let us leave it to a follow up RFE. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20067#discussion_r1716546168