On 5/7/24 11:52 PM, Christoph Müllner wrote:
GCC has a generic cmpmemsi expansion via the by-pieces framework,
which shows some room for target-specific optimizations.
E.g. for comparing two aligned memory blocks of 15 bytes
we get the following sequence:

my_mem_cmp_aligned_15:
         li      a4,0
         j       .L2
.L8:
         bgeu    a4,a7,.L7
.L2:
         add     a2,a0,a4
         add     a3,a1,a4
         lbu     a5,0(a2)
         lbu     a6,0(a3)
         addi    a4,a4,1
         li      a7,15    // missed hoisting
         subw    a5,a5,a6
         andi    a5,a5,0xff // useless
         beq     a5,zero,.L8
         lbu     a0,0(a2) // loading again!
         lbu     a5,0(a3) // loading again!
         subw    a0,a0,a5
         ret
.L7:
         li      a0,0
         ret

Diff first byte: 15 insns
Diff second byte: 25 insns
No diff: 25 insns

Possible improvements:
* unroll the loop and use load-with-displacement to avoid offset increments
* load and compare multiple (aligned) bytes at once
* Use the bitmanip/strcmp result calculation (reverse words and
   synthesize (a2 >= a3) ? 1 : -1 in a branchless sequence)

When applying these improvements we get the following sequence:

my_mem_cmp_aligned_15:
         ld      a5,0(a0)
         ld      a4,0(a1)
         bne     a5,a4,.L2
         ld      a5,8(a0)
         ld      a4,8(a1)
         slli    a5,a5,8
         slli    a4,a4,8
         bne     a5,a4,.L2
         li      a0,0
.L3:
         sext.w  a0,a0
         ret
.L2:
         rev8    a5,a5
         rev8    a4,a4
         sltu    a5,a5,a4
         neg     a5,a5
         ori     a0,a5,1
         j       .L3

Diff first byte: 11 insns
Diff second byte: 16 insns
No diff: 11 insns

This patch implements this improvements.

The tests consist of a execution test (similar to
gcc/testsuite/gcc.dg/torture/inline-mem-cmp-1.c) and a few tests
that test the expansion conditions (known length and alignment).

Similar to the cpymemsi expansion this patch does not introduce any
gating for the cmpmemsi expansion (on top of requiring the known length,
alignment and Zbb).

Bootstrapped and SPEC CPU 2017 tested.

gcc/ChangeLog:

        * config/riscv/riscv-protos.h (riscv_expand_block_compare): New
        prototype.
        * config/riscv/riscv-string.cc (GEN_EMIT_HELPER2): New helper.
        (do_load_from_addr): Add support for HI and SI/64 modes.
        (emit_memcmp_scalar_load_and_compare): New helper to emit memcmp.
        (emit_memcmp_scalar_result_calculation): Likewise.
        (riscv_expand_block_compare_scalar): Likewise.
        (riscv_expand_block_compare): New RISC-V expander for memory compare.
        * config/riscv/riscv.md (cmpmemsi): New cmpmem expansion.

gcc/testsuite/ChangeLog:

        * gcc.target/riscv/cmpmemsi-1.c: New test.
        * gcc.target/riscv/cmpmemsi-2.c: New test.
        * gcc.target/riscv/cmpmemsi-3.c: New test.
        * gcc.target/riscv/cmpmemsi.c: New test.

Signed-off-by: Christoph Müllner <christoph.muell...@vrull.eu>
---
  gcc/config/riscv/riscv-protos.h             |   1 +
  gcc/config/riscv/riscv-string.cc            | 161 ++++++++++++++++++++
  gcc/config/riscv/riscv.md                   |  15 ++
  gcc/testsuite/gcc.target/riscv/cmpmemsi-1.c |   6 +
  gcc/testsuite/gcc.target/riscv/cmpmemsi-2.c |  42 +++++
  gcc/testsuite/gcc.target/riscv/cmpmemsi-3.c |  43 ++++++
  gcc/testsuite/gcc.target/riscv/cmpmemsi.c   |  22 +++
  7 files changed, 290 insertions(+)
  create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi-1.c
  create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi-2.c
  create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi-3.c
  create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi.c

diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index e5aebf3fc3d..30ffe30be1d 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -188,6 +188,7 @@ rtl_opt_pass * make_pass_avlprop (gcc::context *ctxt);
  rtl_opt_pass * make_pass_vsetvl (gcc::context *ctxt);
/* Routines implemented in riscv-string.c. */
+extern bool riscv_expand_block_compare (rtx, rtx, rtx, rtx);
  extern bool riscv_expand_block_move (rtx, rtx, rtx);
/* Information about one CPU we know about. */
diff --git a/gcc/config/riscv/riscv-string.cc b/gcc/config/riscv/riscv-string.cc
index b09b51d7526..9d4dc0cb827 100644
--- a/gcc/config/riscv/riscv-string.cc
+++ b/gcc/config/riscv/riscv-string.cc
@@ -86,6 +86,7 @@ GEN_EMIT_HELPER2(th_rev) /* do_th_rev2  */
  GEN_EMIT_HELPER2(th_tstnbz) /* do_th_tstnbz2  */
  GEN_EMIT_HELPER3(xor) /* do_xor3  */
  GEN_EMIT_HELPER2(zero_extendqi) /* do_zero_extendqi2  */
+GEN_EMIT_HELPER2(zero_extendhi) /* do_zero_extendhi2  */
#undef GEN_EMIT_HELPER2
  #undef GEN_EMIT_HELPER3
@@ -109,6 +110,10 @@ do_load_from_addr (machine_mode mode, rtx dest, rtx 
addr_reg, rtx addr)
if (mode == QImode)
      do_zero_extendqi2 (dest, mem);
+  else if (mode == HImode)
+    do_zero_extendhi2 (dest, mem);
+  else if (mode == SImode && TARGET_64BIT)
+    emit_insn (gen_zero_extendsidi2 (dest, mem));
    else if (mode == Xmode)
      emit_move_insn (dest, mem);
    else
@@ -610,6 +615,162 @@ riscv_expand_strlen (rtx result, rtx src, rtx 
search_char, rtx align)
    return false;
  }
+static void
+emit_memcmp_scalar_load_and_compare (rtx result, rtx src1, rtx src2,
+                                    unsigned HOST_WIDE_INT nbytes,
+                                    rtx data1, rtx data2,
+                                    rtx diff_label, rtx final_label)
Needs a function comment.


+{
+  const unsigned HOST_WIDE_INT xlen = GET_MODE_SIZE (Xmode);
+  rtx src1_addr = force_reg (Pmode, XEXP (src1, 0));
+  rtx src2_addr = force_reg (Pmode, XEXP (src2, 0));
+  unsigned HOST_WIDE_INT offset = 0;
+
+  while (nbytes > 0)
+    {
+      unsigned HOST_WIDE_INT cmp_bytes = xlen < nbytes ? xlen : nbytes;
+      machine_mode load_mode;
+
+      /* Special cases to avoid masking of trailing bytes.  */
+      if (cmp_bytes == 1)
+       load_mode = QImode;
+      else if (cmp_bytes == 2)
+       load_mode = HImode;
+      else if (cmp_bytes == 4)
+       load_mode = SImode;
+      else
+       load_mode = Xmode;
+
+      rtx addr1 = gen_rtx_PLUS (Pmode, src1_addr, GEN_INT (offset));
+      do_load_from_addr (load_mode, data1, addr1, src1);
+      rtx addr2 = gen_rtx_PLUS (Pmode, src2_addr, GEN_INT (offset));
+      do_load_from_addr (load_mode, data2, addr2, src2);
So we're assuming that something simplifies x + 0 -> x for the first iteration of this loop. I'm not sure that's going to be true for -O0. Can you double check that if this expansion code is used with -O0 that it gets cleaned up? adjust_address or something else might be able to do this for you automagically.

+
+static void
+emit_memcmp_scalar_result_calculation (rtx result, rtx data1, rtx data2)
Needs function comment.



+
+static bool
+riscv_expand_block_compare_scalar (rtx result, rtx src1, rtx src2, rtx nbytes)
Function comment.


+
+  /* We need xlen-aligned memory.  */
+  unsigned HOST_WIDE_INT align = MIN (MEM_ALIGN (src1), MEM_ALIGN (src2));
+  if (align < (xlen * BITS_PER_UNIT))
+    return false;
Not needed for this submission, but this probably could be relaxed based on uarch tuning.

You might also ponder if we can do the result calculation in WORD_MODE. One of the things in my queue to submit is adjustment to the inline strcmp code to do that. When done carefully we can squash out a sign extension. Again, not strictly necessary, but perhaps a followup if you're interested.

So generally OK. The only technical concern is generation of x+0 as an address. Depending on what you find there take appropriate action. If nothing is needed in that space, then just fixup the missing comments and this is fine for the trunk.

Jeff

Reply via email to