On Thu, May 9, 2024 at 4:50 PM Jeff Law <jeffreya...@gmail.com> wrote: > > > > On 5/7/24 11:52 PM, Christoph Müllner wrote: > > GCC has a generic cmpmemsi expansion via the by-pieces framework, > > which shows some room for target-specific optimizations. > > E.g. for comparing two aligned memory blocks of 15 bytes > > we get the following sequence: > > > > my_mem_cmp_aligned_15: > > li a4,0 > > j .L2 > > .L8: > > bgeu a4,a7,.L7 > > .L2: > > add a2,a0,a4 > > add a3,a1,a4 > > lbu a5,0(a2) > > lbu a6,0(a3) > > addi a4,a4,1 > > li a7,15 // missed hoisting > > subw a5,a5,a6 > > andi a5,a5,0xff // useless > > beq a5,zero,.L8 > > lbu a0,0(a2) // loading again! > > lbu a5,0(a3) // loading again! > > subw a0,a0,a5 > > ret > > .L7: > > li a0,0 > > ret > > > > Diff first byte: 15 insns > > Diff second byte: 25 insns > > No diff: 25 insns > > > > Possible improvements: > > * unroll the loop and use load-with-displacement to avoid offset increments > > * load and compare multiple (aligned) bytes at once > > * Use the bitmanip/strcmp result calculation (reverse words and > > synthesize (a2 >= a3) ? 1 : -1 in a branchless sequence) > > > > When applying these improvements we get the following sequence: > > > > my_mem_cmp_aligned_15: > > ld a5,0(a0) > > ld a4,0(a1) > > bne a5,a4,.L2 > > ld a5,8(a0) > > ld a4,8(a1) > > slli a5,a5,8 > > slli a4,a4,8 > > bne a5,a4,.L2 > > li a0,0 > > .L3: > > sext.w a0,a0 > > ret > > .L2: > > rev8 a5,a5 > > rev8 a4,a4 > > sltu a5,a5,a4 > > neg a5,a5 > > ori a0,a5,1 > > j .L3 > > > > Diff first byte: 11 insns > > Diff second byte: 16 insns > > No diff: 11 insns > > > > This patch implements this improvements. > > > > The tests consist of a execution test (similar to > > gcc/testsuite/gcc.dg/torture/inline-mem-cmp-1.c) and a few tests > > that test the expansion conditions (known length and alignment). > > > > Similar to the cpymemsi expansion this patch does not introduce any > > gating for the cmpmemsi expansion (on top of requiring the known length, > > alignment and Zbb). > > > > Bootstrapped and SPEC CPU 2017 tested. > > > > gcc/ChangeLog: > > > > * config/riscv/riscv-protos.h (riscv_expand_block_compare): New > > prototype. > > * config/riscv/riscv-string.cc (GEN_EMIT_HELPER2): New helper. > > (do_load_from_addr): Add support for HI and SI/64 modes. > > (emit_memcmp_scalar_load_and_compare): New helper to emit memcmp. > > (emit_memcmp_scalar_result_calculation): Likewise. > > (riscv_expand_block_compare_scalar): Likewise. > > (riscv_expand_block_compare): New RISC-V expander for memory compare. > > * config/riscv/riscv.md (cmpmemsi): New cmpmem expansion. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/riscv/cmpmemsi-1.c: New test. > > * gcc.target/riscv/cmpmemsi-2.c: New test. > > * gcc.target/riscv/cmpmemsi-3.c: New test. > > * gcc.target/riscv/cmpmemsi.c: New test. > > > > Signed-off-by: Christoph Müllner <christoph.muell...@vrull.eu> > > --- > > gcc/config/riscv/riscv-protos.h | 1 + > > gcc/config/riscv/riscv-string.cc | 161 ++++++++++++++++++++ > > gcc/config/riscv/riscv.md | 15 ++ > > gcc/testsuite/gcc.target/riscv/cmpmemsi-1.c | 6 + > > gcc/testsuite/gcc.target/riscv/cmpmemsi-2.c | 42 +++++ > > gcc/testsuite/gcc.target/riscv/cmpmemsi-3.c | 43 ++++++ > > gcc/testsuite/gcc.target/riscv/cmpmemsi.c | 22 +++ > > 7 files changed, 290 insertions(+) > > create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi-1.c > > create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi-2.c > > create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi-3.c > > create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi.c > > > > diff --git a/gcc/config/riscv/riscv-protos.h > > b/gcc/config/riscv/riscv-protos.h > > index e5aebf3fc3d..30ffe30be1d 100644 > > --- a/gcc/config/riscv/riscv-protos.h > > +++ b/gcc/config/riscv/riscv-protos.h > > @@ -188,6 +188,7 @@ rtl_opt_pass * make_pass_avlprop (gcc::context *ctxt); > > rtl_opt_pass * make_pass_vsetvl (gcc::context *ctxt); > > > > /* Routines implemented in riscv-string.c. */ > > +extern bool riscv_expand_block_compare (rtx, rtx, rtx, rtx); > > extern bool riscv_expand_block_move (rtx, rtx, rtx); > > > > /* Information about one CPU we know about. */ > > diff --git a/gcc/config/riscv/riscv-string.cc > > b/gcc/config/riscv/riscv-string.cc > > index b09b51d7526..9d4dc0cb827 100644 > > --- a/gcc/config/riscv/riscv-string.cc > > +++ b/gcc/config/riscv/riscv-string.cc > > @@ -86,6 +86,7 @@ GEN_EMIT_HELPER2(th_rev) /* do_th_rev2 */ > > GEN_EMIT_HELPER2(th_tstnbz) /* do_th_tstnbz2 */ > > GEN_EMIT_HELPER3(xor) /* do_xor3 */ > > GEN_EMIT_HELPER2(zero_extendqi) /* do_zero_extendqi2 */ > > +GEN_EMIT_HELPER2(zero_extendhi) /* do_zero_extendhi2 */ > > > > #undef GEN_EMIT_HELPER2 > > #undef GEN_EMIT_HELPER3 > > @@ -109,6 +110,10 @@ do_load_from_addr (machine_mode mode, rtx dest, rtx > > addr_reg, rtx addr) > > > > if (mode == QImode) > > do_zero_extendqi2 (dest, mem); > > + else if (mode == HImode) > > + do_zero_extendhi2 (dest, mem); > > + else if (mode == SImode && TARGET_64BIT) > > + emit_insn (gen_zero_extendsidi2 (dest, mem)); > > else if (mode == Xmode) > > emit_move_insn (dest, mem); > > else > > @@ -610,6 +615,162 @@ riscv_expand_strlen (rtx result, rtx src, rtx > > search_char, rtx align) > > return false; > > } > > > > +static void > > +emit_memcmp_scalar_load_and_compare (rtx result, rtx src1, rtx src2, > > + unsigned HOST_WIDE_INT nbytes, > > + rtx data1, rtx data2, > > + rtx diff_label, rtx final_label) > Needs a function comment.
Will do. > > > > +{ > > + const unsigned HOST_WIDE_INT xlen = GET_MODE_SIZE (Xmode); > > + rtx src1_addr = force_reg (Pmode, XEXP (src1, 0)); > > + rtx src2_addr = force_reg (Pmode, XEXP (src2, 0)); > > + unsigned HOST_WIDE_INT offset = 0; > > + > > + while (nbytes > 0) > > + { > > + unsigned HOST_WIDE_INT cmp_bytes = xlen < nbytes ? xlen : nbytes; > > + machine_mode load_mode; > > + > > + /* Special cases to avoid masking of trailing bytes. */ > > + if (cmp_bytes == 1) > > + load_mode = QImode; > > + else if (cmp_bytes == 2) > > + load_mode = HImode; > > + else if (cmp_bytes == 4) > > + load_mode = SImode; > > + else > > + load_mode = Xmode; > > + > > + rtx addr1 = gen_rtx_PLUS (Pmode, src1_addr, GEN_INT (offset)); > > + do_load_from_addr (load_mode, data1, addr1, src1); > > + rtx addr2 = gen_rtx_PLUS (Pmode, src2_addr, GEN_INT (offset)); > > + do_load_from_addr (load_mode, data2, addr2, src2); > So we're assuming that something simplifies x + 0 -> x for the first > iteration of this loop. I'm not sure that's going to be true for -O0. > Can you double check that if this expansion code is used with -O0 that > it gets cleaned up? adjust_address or something else might be able to > do this for you automagically. -O0 does not trigger the expansion. But I'll change to adjust_address(), where this is properly handled. I'll also send a patch to adjust this for the strcmp expansion (where I copied from). > > > + > > +static void > > +emit_memcmp_scalar_result_calculation (rtx result, rtx data1, rtx data2) > Needs function comment. > > > > > + > > +static bool > > +riscv_expand_block_compare_scalar (rtx result, rtx src1, rtx src2, rtx > > nbytes) > Function comment. Will do. > > > > + > > + /* We need xlen-aligned memory. */ > > + unsigned HOST_WIDE_INT align = MIN (MEM_ALIGN (src1), MEM_ALIGN (src2)); > > + if (align < (xlen * BITS_PER_UNIT)) > > + return false; > Not needed for this submission, but this probably could be relaxed based > on uarch tuning. > > You might also ponder if we can do the result calculation in WORD_MODE. > One of the things in my queue to submit is adjustment to the inline > strcmp code to do that. When done carefully we can squash out a sign > extension. Again, not strictly necessary, but perhaps a followup if > you're interested. > > So generally OK. The only technical concern is generation of x+0 as an > address. Depending on what you find there take appropriate action. If > nothing is needed in that space, then just fixup the missing comments > and this is fine for the trunk. > > Jeff