12] aarch64: Implement new expander for efficient CRC computation

Richard Sandiford Wed, 10 Jul 2024 09:39:27 -0700

Mariam Arutunian <mariamarutun...@gmail.com> writes:
> On Sat, Jun 8, 2024 at 3:41 PM Richard Sandiford <richard.sandif...@arm.com>
> wrote:
>
>> Mariam Arutunian <mariamarutun...@gmail.com> writes:
>> > This patch introduces two new expanders for the aarch64 backend,
>> > dedicated to generate optimized code for CRC computations.
>> > The new expanders are designed to leverage specific hardware capabilities
>> > to achieve faster CRC calculations,
>> > particularly using the pmul or crc32 instructions when supported by the
>> > target architecture.
>>
>> Thanks for porting this to aarch64!
>>
>> > Expander 1: Bit-Forward CRC (crc<ALLI:mode><ALLX:mode>4)
>> > For targets that support pmul instruction (TARGET_AES),
>> > the expander will generate code that uses the pmul (crypto_pmulldi)
>> > instruction for CRC computation.
>> >
>> > Expander 2: Bit-Reversed CRC (crc_rev<ALLI:mode><ALLX:mode>4)
>> > The expander first checks if the target supports the CRC32 instruction
>> set
>> > (TARGET_CRC32)
>> > and the polynomial in use is 0x1EDC6F41 (iSCSI). If the conditions are
>> met,
>> > it emits calls to the corresponding crc32 instruction (crc32b, crc32h,
>> > crc32w, or crc32x depending on the data size).
>> > If the target does not support crc32 but supports pmul, it then uses the
>> > pmul (crypto_pmulldi) instruction for bit-reversed CRC computation.
>> >
>> > Otherwise table-based CRC is generated.
>> >
>> >   gcc/config/aarch64/
>> >
>> >     * aarch64-protos.h (aarch64_expand_crc_using_clmul): New extern
>> > function declaration.
>> >     (aarch64_expand_reversed_crc_using_clmul):  Likewise.
>> >     * aarch64.cc (aarch64_expand_crc_using_clmul): New function.
>> >     (aarch64_expand_reversed_crc_using_clmul):  Likewise.
>> >     * aarch64.md (UNSPEC_CRC, UNSPEC_CRC_REV):  New unspecs.
>> >     (crc_rev<ALLI:mode><ALLX:mode>4): New expander for reversed CRC.
>> >     (crc<ALLI:mode><ALLX:mode>4): New expander for reversed CRC.
>> >     * iterators.md (crc_data_type): New mode attribute.
>> >
>> >   gcc/testsuite/gcc.target/aarch64/
>> >
>> >     * crc-1-pmul.c: Likewise.
>> >     * crc-10-pmul.c: Likewise.
>> >     * crc-12-pmul.c: Likewise.
>> >     * crc-13-pmul.c: Likewise.
>> >     * crc-14-pmul.c: Likewise.
>> >     * crc-17-pmul.c: Likewise.
>> >     * crc-18-pmul.c: Likewise.
>> >     * crc-21-pmul.c: Likewise.
>> >     * crc-22-pmul.c: Likewise.
>> >     * crc-23-pmul.c: Likewise.
>> >     * crc-4-pmul.c: Likewise.
>> >     * crc-5-pmul.c: Likewise.
>> >     * crc-6-pmul.c: Likewise.
>> >     * crc-7-pmul.c: Likewise.
>> >     * crc-8-pmul.c: Likewise.
>> >     * crc-9-pmul.c: Likewise.
>> >     * crc-CCIT-data16-pmul.c: Likewise.
>> >     * crc-CCIT-data8-pmul.c: Likewise.
>> >     * crc-coremark-16bitdata-pmul.c: Likewise.
>> >     * crc-crc32-data16.c: New test.
>> >     * crc-crc32-data32.c: Likewise.
>> >     * crc-crc32-data8.c: Likewise.
>> >
>> > Signed-off-by: Mariam Arutunian <mariamarutun...@gmail.com
>> > diff --git a/gcc/config/aarch64/aarch64-protos.h
>> b/gcc/config/aarch64/aarch64-protos.h
>> > index 1d3f94c813e..167e1140f0d 100644
>> > --- a/gcc/config/aarch64/aarch64-protos.h
>> > +++ b/gcc/config/aarch64/aarch64-protos.h
>> > @@ -1117,5 +1117,8 @@ extern void mingw_pe_encode_section_info (tree,
>> rtx, int);
>> >
>> >  bool aarch64_optimize_mode_switching (aarch64_mode_entity);
>> >  void aarch64_restore_za (rtx);
>> > +void aarch64_expand_crc_using_clmul (rtx *);
>> > +void aarch64_expand_reversed_crc_using_clmul (rtx *);
>> > +
>> >
>> >  #endif /* GCC_AARCH64_PROTOS_H */
>> > diff --git a/gcc/config/aarch64/aarch64.cc
>> b/gcc/config/aarch64/aarch64.cc
>> > index ee12d8897a8..05cd0296d38 100644
>> > --- a/gcc/config/aarch64/aarch64.cc
>> > +++ b/gcc/config/aarch64/aarch64.cc
>> > @@ -30265,6 +30265,135 @@ aarch64_retrieve_sysreg (const char *regname,
>> bool write_p, bool is128op)
>> >    return sysreg->encoding;
>> >  }
>> >
>> > +/* Generate assembly to calculate CRC
>> > +   using carry-less multiplication instruction.
>> > +   OPERANDS[1] is input CRC,
>> > +   OPERANDS[2] is data (message),
>> > +   OPERANDS[3] is the polynomial without the leading 1.  */
>> > +
>> > +void
>> > +aarch64_expand_crc_using_clmul (rtx *operands)
>>
>> This should probably be pmul rather than clmul.
>>
>> +{
>> > +  /* Check and keep arguments.  */
>> > +  gcc_assert (!CONST_INT_P (operands[0]));
>> > +  gcc_assert (CONST_INT_P (operands[3]));
>> > +  rtx crc = operands[1];
>> > +  rtx data = operands[2];
>> > +  rtx polynomial = operands[3];
>> > +
>> > +  unsigned HOST_WIDE_INT
>> > +      crc_size = GET_MODE_BITSIZE (GET_MODE (operands[0])).to_constant
>> ();
>> > +  gcc_assert (crc_size <= 32);
>> > +  unsigned HOST_WIDE_INT
>> > +      data_size = GET_MODE_BITSIZE (GET_MODE (data)).to_constant ();
>>
>> We could instead make the interface:
>>
>> void
>> aarch64_expand_crc_using_pmul (scalar_mode crc_mode, scalar_mode data_mode,
>>                                rtx *operands)
>>
>> so that the lines above don't need the to_constant.  This should "just
>> work" on the .md file side, since the modes being passed are naturally
>> scalar_mode.
>>
>> I think it'd be worth asserting also that data_size <= crc_size.
>> (Although we could handle any MAX (data_size, crc_size) <= 32
>> with some adjustment.)
>>
>> > +
>> > +  /* Calculate the quotient.  */
>> > +  unsigned HOST_WIDE_INT
>> > +      q = gf2n_poly_long_div_quotient (UINTVAL (polynomial), crc_size +
>> 1);
>> > +
>> > +  /* CRC calculation's main part.  */
>> > +  if (crc_size > data_size)
>> > +    crc = expand_shift (RSHIFT_EXPR, DImode, crc, crc_size - data_size,
>> > +                     NULL_RTX, 1);
>> > +
>> > +  rtx t0 = gen_reg_rtx (DImode);
>> > +  aarch64_emit_move (t0, gen_int_mode (q, DImode));
>>
>> It's only a minor simplification, but this could instead be:
>>
>>   rtx t0 = force_reg (DImode, gen_int_mode (q, DImode));
>>
>> > +  rtx t1 = gen_reg_rtx (DImode);
>> > +  aarch64_emit_move (t1, polynomial);
>>
>> If polynomial is a constant operand of mode crc_mode, GCC's standard
>> CONST_INT representation is to sign-extend the constant to 64 bits.
>> E.g. a QImode value of 0b1000_0000 would be represented as -128.
>>
>> I think here we want the zero-extended form, so it might be safer to do:
>>
>>   polynomial = simplify_gen_unary (ZERO_EXTEND, DImode, polynomial,
>> crc_mode);
>>   rtx t1 = force_reg (DImode, polynomial);
>>
>> > +
>> > +  rtx a0 = expand_binop (DImode, xor_optab, crc, data, NULL_RTX, 1,
>> > +                      OPTAB_WIDEN);
>> > +
>> > +  rtx clmul_res = gen_reg_rtx (TImode);
>> > +  emit_insn (gen_aarch64_crypto_pmulldi (clmul_res, a0, t0));
>> > +  a0 = gen_lowpart (DImode, clmul_res);
>> > +
>> > +  a0 = expand_shift (RSHIFT_EXPR, DImode, a0, crc_size, NULL_RTX, 1);
>> > +
>> > +  emit_insn (gen_aarch64_crypto_pmulldi (clmul_res, a0, t1));
>> > +  a0 = gen_lowpart (DImode, clmul_res);
>> > +
>> > +  if (crc_size > data_size)
>> > +    {
>> > +      rtx crc_part = expand_shift (LSHIFT_EXPR, DImode, operands[1],
>> data_size,
>> > +                                NULL_RTX, 0);
>> > +      a0 =  expand_binop (DImode, xor_optab, a0, crc_part, NULL_RTX, 1,
>> > +                       OPTAB_DIRECT);
>>
>> Formatting nit: extra space after "a0 = "
>>
>> > +    }
>> > +  /* Zero upper bits beyond crc_size.  */
>> > +  rtx num_shift = gen_int_mode (64 - crc_size, DImode);
>> > +  a0 = expand_shift (LSHIFT_EXPR, DImode, a0, 64 - crc_size,  NULL_RTX,
>> 0);
>> > +  a0 = expand_shift (RSHIFT_EXPR, DImode, a0, 64 - crc_size,  NULL_RTX,
>> 1);
>>
>> Rather than shift left and then right, I think we should just AND:
>>
>>   rtx mask = gen_int_mode (GET_MODE_MASK (crc_mode), DImode);
>>   a0 = expand_binop (DImode, and_optab, a0, mask, NULL_RTX, 1,
>> OPTAB_DIRECT);
>>
>> That said, it looks like operands[0] has crc_mode.  The register bits
>> above crc_size therefore shouldn't matter, since they're undefined on read.
>> E.g. even though (reg:SI R) is stored in an X register, only the low 32
>> bits are defined; the upper 32 bits can be any value.
>>
>> So I'd expect we could replace this and...
>>
>> > +
>> > +  rtx tgt = simplify_gen_subreg (DImode, operands[0],
>> > +                              GET_MODE (operands[0]), 0);
>> > +  aarch64_emit_move (tgt, a0);
>>
>> ...this with just:
>>
>>   aarch64_emitmove (operands[0], gen_lowpart (crc_mode, a0));
>>
>> Perhaps that would break down if operands[0] is a subreg with
>> SUBREG_PROMOTED_VAR_P set, but I think it's up to target-independent
>> code to handle that case.
>>
>
> Thank you once again for your review.
> I made all the recommended changes except for this and sent the new version.
> On some tests the CRC function was returning the whole computed value,
> without zeroing the upper part.


Thanks for the update and sorry for the (very) slow review.

I tried this locally, and it looks like the need to zero is coming from
the SUBREG_PROMOTED_VAR_P thing after all.  In expand_crc_optab_fn

  rtx dest = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);

is returning a subreg like:

  (subreg/s/v:SI (reg:DI R) 0)

where /s == SUBREG_PROMOTED_VAR_P and /v == SUBREG_PROMOTED_UNSIGNED_P.

This means that the R is known to be a zero extension of a 32-bit value
on read and must be set to such a value on write.

In the patch, operands[0] is the subreg above and so things break
if we don't set R to a zero-extended value,.

But optabs aren't generally required to handle this kind of subreg
as a destination.  It's usually up to target-independent code to
deal with them instead.

expand_fn_using_insn handles this kind of situation by creating
a temporary SI destination, then copying a zero-extension of the
temporary back to R.

I've pushed 5686d3b8ae16d9aeea8d39a56ec6f8ecee661e01 to split this code
out into helper routines so that other functions in internal-fn.cc can
use them.  Applying:

@@ -3984,15 +3997,14 @@ expand_crc_optab_fn (internal_fn fn, gcall *stmt, 
convert_optab optab)
       (fn, tree_pair (data_type, result_type), OPTIMIZE_FOR_SPEED))
     {
       class expand_operand ops[4];
-      create_output_operand (&ops[0], dest, TYPE_MODE (result_type));
+      create_call_lhs_operand (&ops[0], dest, TYPE_MODE (result_type));
       create_input_operand (&ops[1], crc, TYPE_MODE (result_type));
       create_input_operand (&ops[2], data, TYPE_MODE (data_type));
       create_input_operand (&ops[3], polynomial, TYPE_MODE (result_type));
       insn_code icode = convert_optab_handler (optab, TYPE_MODE (data_type),
                                               TYPE_MODE (result_type));
       expand_insn (icode, 4, ops);
-      if (!rtx_equal_p (dest, ops[0].value))
-       emit_move_insn (dest, ops[0].value);
+      assign_call_lhs (lhs, dest, &ops[0]);
     }
   else
     {

on top of that seems to avoid the need for the zero extension.

> I'm also unsure about the indentation. Last time, I thought I had set it
> correctly, but it turned out to be incorrect. This time, I've adjusted the
> settings, but I'm still not sure if it's correct.

Thanks.  The indentation mostly looks good.  I think the main remaining
thing is that arguments tend to be indented too far.  E.g.:

aarch64_expand_crc_using_pmul (scalar_mode crc_mode,
                                   scalar_mode data_mode,
                                   rtx *operands)

or, as spaces:

aarch64_expand_crc_using_pmul (scalar_mode crc_mode,
                                   scalar_mode data_mode,
                                   rtx *operands)

This is probably just because of GCC's unfortunate insistence on
using tabs to indent by 8 characters, which feels very anachronistic
these days.  From a quick check, it looks like in all instances of
the above, the arguments are indented by N columns too many if the
line is indented by N tabs.  That sounds consistent with a tab width
of 7 rather than 8, but maybe it's something else.

I can fix the indentation before pushing though, if that's more
convenient.

Richard

Re: [RFC/RFA] [PATCH 06/12] aarch64: Implement new expander for efficient CRC computation

Reply via email to