On Thu, May 9, 2024 at 4:50 PM Jeff Law <jeffreya...@gmail.com> wrote:
>
>
>
> On 5/7/24 11:52 PM, Christoph Müllner wrote:
> > GCC has a generic cmpmemsi expansion via the by-pieces framework,
> > which shows some room for target-specific optimizations.
> > E.g. for comparing two aligned memory blocks of 15 bytes
> > we get the following sequence:
> >
> > my_mem_cmp_aligned_15:
> >          li      a4,0
> >          j       .L2
> > .L8:
> >          bgeu    a4,a7,.L7
> > .L2:
> >          add     a2,a0,a4
> >          add     a3,a1,a4
> >          lbu     a5,0(a2)
> >          lbu     a6,0(a3)
> >          addi    a4,a4,1
> >          li      a7,15    // missed hoisting
> >          subw    a5,a5,a6
> >          andi    a5,a5,0xff // useless
> >          beq     a5,zero,.L8
> >          lbu     a0,0(a2) // loading again!
> >          lbu     a5,0(a3) // loading again!
> >          subw    a0,a0,a5
> >          ret
> > .L7:
> >          li      a0,0
> >          ret
> >
> > Diff first byte: 15 insns
> > Diff second byte: 25 insns
> > No diff: 25 insns
> >
> > Possible improvements:
> > * unroll the loop and use load-with-displacement to avoid offset increments
> > * load and compare multiple (aligned) bytes at once
> > * Use the bitmanip/strcmp result calculation (reverse words and
> >    synthesize (a2 >= a3) ? 1 : -1 in a branchless sequence)
> >
> > When applying these improvements we get the following sequence:
> >
> > my_mem_cmp_aligned_15:
> >          ld      a5,0(a0)
> >          ld      a4,0(a1)
> >          bne     a5,a4,.L2
> >          ld      a5,8(a0)
> >          ld      a4,8(a1)
> >          slli    a5,a5,8
> >          slli    a4,a4,8
> >          bne     a5,a4,.L2
> >          li      a0,0
> > .L3:
> >          sext.w  a0,a0
> >          ret
> > .L2:
> >          rev8    a5,a5
> >          rev8    a4,a4
> >          sltu    a5,a5,a4
> >          neg     a5,a5
> >          ori     a0,a5,1
> >          j       .L3
> >
> > Diff first byte: 11 insns
> > Diff second byte: 16 insns
> > No diff: 11 insns
> >
> > This patch implements this improvements.
> >
> > The tests consist of a execution test (similar to
> > gcc/testsuite/gcc.dg/torture/inline-mem-cmp-1.c) and a few tests
> > that test the expansion conditions (known length and alignment).
> >
> > Similar to the cpymemsi expansion this patch does not introduce any
> > gating for the cmpmemsi expansion (on top of requiring the known length,
> > alignment and Zbb).
> >
> > Bootstrapped and SPEC CPU 2017 tested.
> >
> > gcc/ChangeLog:
> >
> >       * config/riscv/riscv-protos.h (riscv_expand_block_compare): New
> >       prototype.
> >       * config/riscv/riscv-string.cc (GEN_EMIT_HELPER2): New helper.
> >       (do_load_from_addr): Add support for HI and SI/64 modes.
> >       (emit_memcmp_scalar_load_and_compare): New helper to emit memcmp.
> >       (emit_memcmp_scalar_result_calculation): Likewise.
> >       (riscv_expand_block_compare_scalar): Likewise.
> >       (riscv_expand_block_compare): New RISC-V expander for memory compare.
> >       * config/riscv/riscv.md (cmpmemsi): New cmpmem expansion.
> >
> > gcc/testsuite/ChangeLog:
> >
> >       * gcc.target/riscv/cmpmemsi-1.c: New test.
> >       * gcc.target/riscv/cmpmemsi-2.c: New test.
> >       * gcc.target/riscv/cmpmemsi-3.c: New test.
> >       * gcc.target/riscv/cmpmemsi.c: New test.
> >
> > Signed-off-by: Christoph Müllner <christoph.muell...@vrull.eu>
> > ---
> >   gcc/config/riscv/riscv-protos.h             |   1 +
> >   gcc/config/riscv/riscv-string.cc            | 161 ++++++++++++++++++++
> >   gcc/config/riscv/riscv.md                   |  15 ++
> >   gcc/testsuite/gcc.target/riscv/cmpmemsi-1.c |   6 +
> >   gcc/testsuite/gcc.target/riscv/cmpmemsi-2.c |  42 +++++
> >   gcc/testsuite/gcc.target/riscv/cmpmemsi-3.c |  43 ++++++
> >   gcc/testsuite/gcc.target/riscv/cmpmemsi.c   |  22 +++
> >   7 files changed, 290 insertions(+)
> >   create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi-1.c
> >   create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi-2.c
> >   create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi-3.c
> >   create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi.c
> >
> > diff --git a/gcc/config/riscv/riscv-protos.h 
> > b/gcc/config/riscv/riscv-protos.h
> > index e5aebf3fc3d..30ffe30be1d 100644
> > --- a/gcc/config/riscv/riscv-protos.h
> > +++ b/gcc/config/riscv/riscv-protos.h
> > @@ -188,6 +188,7 @@ rtl_opt_pass * make_pass_avlprop (gcc::context *ctxt);
> >   rtl_opt_pass * make_pass_vsetvl (gcc::context *ctxt);
> >
> >   /* Routines implemented in riscv-string.c.  */
> > +extern bool riscv_expand_block_compare (rtx, rtx, rtx, rtx);
> >   extern bool riscv_expand_block_move (rtx, rtx, rtx);
> >
> >   /* Information about one CPU we know about.  */
> > diff --git a/gcc/config/riscv/riscv-string.cc 
> > b/gcc/config/riscv/riscv-string.cc
> > index b09b51d7526..9d4dc0cb827 100644
> > --- a/gcc/config/riscv/riscv-string.cc
> > +++ b/gcc/config/riscv/riscv-string.cc
> > @@ -86,6 +86,7 @@ GEN_EMIT_HELPER2(th_rev) /* do_th_rev2  */
> >   GEN_EMIT_HELPER2(th_tstnbz) /* do_th_tstnbz2  */
> >   GEN_EMIT_HELPER3(xor) /* do_xor3  */
> >   GEN_EMIT_HELPER2(zero_extendqi) /* do_zero_extendqi2  */
> > +GEN_EMIT_HELPER2(zero_extendhi) /* do_zero_extendhi2  */
> >
> >   #undef GEN_EMIT_HELPER2
> >   #undef GEN_EMIT_HELPER3
> > @@ -109,6 +110,10 @@ do_load_from_addr (machine_mode mode, rtx dest, rtx 
> > addr_reg, rtx addr)
> >
> >     if (mode == QImode)
> >       do_zero_extendqi2 (dest, mem);
> > +  else if (mode == HImode)
> > +    do_zero_extendhi2 (dest, mem);
> > +  else if (mode == SImode && TARGET_64BIT)
> > +    emit_insn (gen_zero_extendsidi2 (dest, mem));
> >     else if (mode == Xmode)
> >       emit_move_insn (dest, mem);
> >     else
> > @@ -610,6 +615,162 @@ riscv_expand_strlen (rtx result, rtx src, rtx 
> > search_char, rtx align)
> >     return false;
> >   }
> >
> > +static void
> > +emit_memcmp_scalar_load_and_compare (rtx result, rtx src1, rtx src2,
> > +                                  unsigned HOST_WIDE_INT nbytes,
> > +                                  rtx data1, rtx data2,
> > +                                  rtx diff_label, rtx final_label)
> Needs a function comment.

Will do.

>
>
> > +{
> > +  const unsigned HOST_WIDE_INT xlen = GET_MODE_SIZE (Xmode);
> > +  rtx src1_addr = force_reg (Pmode, XEXP (src1, 0));
> > +  rtx src2_addr = force_reg (Pmode, XEXP (src2, 0));
> > +  unsigned HOST_WIDE_INT offset = 0;
> > +
> > +  while (nbytes > 0)
> > +    {
> > +      unsigned HOST_WIDE_INT cmp_bytes = xlen < nbytes ? xlen : nbytes;
> > +      machine_mode load_mode;
> > +
> > +      /* Special cases to avoid masking of trailing bytes.  */
> > +      if (cmp_bytes == 1)
> > +     load_mode = QImode;
> > +      else if (cmp_bytes == 2)
> > +     load_mode = HImode;
> > +      else if (cmp_bytes == 4)
> > +     load_mode = SImode;
> > +      else
> > +     load_mode = Xmode;
> > +
> > +      rtx addr1 = gen_rtx_PLUS (Pmode, src1_addr, GEN_INT (offset));
> > +      do_load_from_addr (load_mode, data1, addr1, src1);
> > +      rtx addr2 = gen_rtx_PLUS (Pmode, src2_addr, GEN_INT (offset));
> > +      do_load_from_addr (load_mode, data2, addr2, src2);
> So we're assuming that something simplifies x + 0 -> x for the first
> iteration of this loop.  I'm not sure that's going to be true for -O0.
> Can you double check that if this expansion code is used with -O0 that
> it gets cleaned up?  adjust_address or something else might be able to
> do this for you automagically.

-O0 does not trigger the expansion.
But I'll change to adjust_address(), where this is properly handled.
I'll also send a patch to adjust this for the strcmp expansion (where
I copied from).


>
> > +
> > +static void
> > +emit_memcmp_scalar_result_calculation (rtx result, rtx data1, rtx data2)
> Needs function comment.
>
>
>
> > +
> > +static bool
> > +riscv_expand_block_compare_scalar (rtx result, rtx src1, rtx src2, rtx 
> > nbytes)
> Function comment.

Will do.

>
>
> > +
> > +  /* We need xlen-aligned memory.  */
> > +  unsigned HOST_WIDE_INT align = MIN (MEM_ALIGN (src1), MEM_ALIGN (src2));
> > +  if (align < (xlen * BITS_PER_UNIT))
> > +    return false;
> Not needed for this submission, but this probably could be relaxed based
> on uarch tuning.



>
> You might also ponder if we can do the result calculation in WORD_MODE.
> One of the things in my queue to submit is adjustment to the inline
> strcmp code to do that.  When done carefully we can squash out a sign
> extension.  Again, not strictly necessary, but perhaps a followup if
> you're interested.
>
> So generally OK.  The only technical concern is generation of x+0 as an
> address.  Depending on what you find there take appropriate action.  If
> nothing is needed in that space, then just fixup the missing comments
> and this is fine for the trunk.
>
> Jeff

Reply via email to