Mariam Arutunian <mariamarutun...@gmail.com> writes: > On Sat, Jun 8, 2024 at 3:41 PM Richard Sandiford <richard.sandif...@arm.com> > wrote: > >> Mariam Arutunian <mariamarutun...@gmail.com> writes: >> > This patch introduces two new expanders for the aarch64 backend, >> > dedicated to generate optimized code for CRC computations. >> > The new expanders are designed to leverage specific hardware capabilities >> > to achieve faster CRC calculations, >> > particularly using the pmul or crc32 instructions when supported by the >> > target architecture. >> >> Thanks for porting this to aarch64! >> >> > Expander 1: Bit-Forward CRC (crc<ALLI:mode><ALLX:mode>4) >> > For targets that support pmul instruction (TARGET_AES), >> > the expander will generate code that uses the pmul (crypto_pmulldi) >> > instruction for CRC computation. >> > >> > Expander 2: Bit-Reversed CRC (crc_rev<ALLI:mode><ALLX:mode>4) >> > The expander first checks if the target supports the CRC32 instruction >> set >> > (TARGET_CRC32) >> > and the polynomial in use is 0x1EDC6F41 (iSCSI). If the conditions are >> met, >> > it emits calls to the corresponding crc32 instruction (crc32b, crc32h, >> > crc32w, or crc32x depending on the data size). >> > If the target does not support crc32 but supports pmul, it then uses the >> > pmul (crypto_pmulldi) instruction for bit-reversed CRC computation. >> > >> > Otherwise table-based CRC is generated. >> > >> > gcc/config/aarch64/ >> > >> > * aarch64-protos.h (aarch64_expand_crc_using_clmul): New extern >> > function declaration. >> > (aarch64_expand_reversed_crc_using_clmul): Likewise. >> > * aarch64.cc (aarch64_expand_crc_using_clmul): New function. >> > (aarch64_expand_reversed_crc_using_clmul): Likewise. >> > * aarch64.md (UNSPEC_CRC, UNSPEC_CRC_REV): New unspecs. >> > (crc_rev<ALLI:mode><ALLX:mode>4): New expander for reversed CRC. >> > (crc<ALLI:mode><ALLX:mode>4): New expander for reversed CRC. >> > * iterators.md (crc_data_type): New mode attribute. >> > >> > gcc/testsuite/gcc.target/aarch64/ >> > >> > * crc-1-pmul.c: Likewise. >> > * crc-10-pmul.c: Likewise. >> > * crc-12-pmul.c: Likewise. >> > * crc-13-pmul.c: Likewise. >> > * crc-14-pmul.c: Likewise. >> > * crc-17-pmul.c: Likewise. >> > * crc-18-pmul.c: Likewise. >> > * crc-21-pmul.c: Likewise. >> > * crc-22-pmul.c: Likewise. >> > * crc-23-pmul.c: Likewise. >> > * crc-4-pmul.c: Likewise. >> > * crc-5-pmul.c: Likewise. >> > * crc-6-pmul.c: Likewise. >> > * crc-7-pmul.c: Likewise. >> > * crc-8-pmul.c: Likewise. >> > * crc-9-pmul.c: Likewise. >> > * crc-CCIT-data16-pmul.c: Likewise. >> > * crc-CCIT-data8-pmul.c: Likewise. >> > * crc-coremark-16bitdata-pmul.c: Likewise. >> > * crc-crc32-data16.c: New test. >> > * crc-crc32-data32.c: Likewise. >> > * crc-crc32-data8.c: Likewise. >> > >> > Signed-off-by: Mariam Arutunian <mariamarutun...@gmail.com >> > diff --git a/gcc/config/aarch64/aarch64-protos.h >> b/gcc/config/aarch64/aarch64-protos.h >> > index 1d3f94c813e..167e1140f0d 100644 >> > --- a/gcc/config/aarch64/aarch64-protos.h >> > +++ b/gcc/config/aarch64/aarch64-protos.h >> > @@ -1117,5 +1117,8 @@ extern void mingw_pe_encode_section_info (tree, >> rtx, int); >> > >> > bool aarch64_optimize_mode_switching (aarch64_mode_entity); >> > void aarch64_restore_za (rtx); >> > +void aarch64_expand_crc_using_clmul (rtx *); >> > +void aarch64_expand_reversed_crc_using_clmul (rtx *); >> > + >> > >> > #endif /* GCC_AARCH64_PROTOS_H */ >> > diff --git a/gcc/config/aarch64/aarch64.cc >> b/gcc/config/aarch64/aarch64.cc >> > index ee12d8897a8..05cd0296d38 100644 >> > --- a/gcc/config/aarch64/aarch64.cc >> > +++ b/gcc/config/aarch64/aarch64.cc >> > @@ -30265,6 +30265,135 @@ aarch64_retrieve_sysreg (const char *regname, >> bool write_p, bool is128op) >> > return sysreg->encoding; >> > } >> > >> > +/* Generate assembly to calculate CRC >> > + using carry-less multiplication instruction. >> > + OPERANDS[1] is input CRC, >> > + OPERANDS[2] is data (message), >> > + OPERANDS[3] is the polynomial without the leading 1. */ >> > + >> > +void >> > +aarch64_expand_crc_using_clmul (rtx *operands) >> >> This should probably be pmul rather than clmul. >> >> +{ >> > + /* Check and keep arguments. */ >> > + gcc_assert (!CONST_INT_P (operands[0])); >> > + gcc_assert (CONST_INT_P (operands[3])); >> > + rtx crc = operands[1]; >> > + rtx data = operands[2]; >> > + rtx polynomial = operands[3]; >> > + >> > + unsigned HOST_WIDE_INT >> > + crc_size = GET_MODE_BITSIZE (GET_MODE (operands[0])).to_constant >> (); >> > + gcc_assert (crc_size <= 32); >> > + unsigned HOST_WIDE_INT >> > + data_size = GET_MODE_BITSIZE (GET_MODE (data)).to_constant (); >> >> We could instead make the interface: >> >> void >> aarch64_expand_crc_using_pmul (scalar_mode crc_mode, scalar_mode data_mode, >> rtx *operands) >> >> so that the lines above don't need the to_constant. This should "just >> work" on the .md file side, since the modes being passed are naturally >> scalar_mode. >> >> I think it'd be worth asserting also that data_size <= crc_size. >> (Although we could handle any MAX (data_size, crc_size) <= 32 >> with some adjustment.) >> >> > + >> > + /* Calculate the quotient. */ >> > + unsigned HOST_WIDE_INT >> > + q = gf2n_poly_long_div_quotient (UINTVAL (polynomial), crc_size + >> 1); >> > + >> > + /* CRC calculation's main part. */ >> > + if (crc_size > data_size) >> > + crc = expand_shift (RSHIFT_EXPR, DImode, crc, crc_size - data_size, >> > + NULL_RTX, 1); >> > + >> > + rtx t0 = gen_reg_rtx (DImode); >> > + aarch64_emit_move (t0, gen_int_mode (q, DImode)); >> >> It's only a minor simplification, but this could instead be: >> >> rtx t0 = force_reg (DImode, gen_int_mode (q, DImode)); >> >> > + rtx t1 = gen_reg_rtx (DImode); >> > + aarch64_emit_move (t1, polynomial); >> >> If polynomial is a constant operand of mode crc_mode, GCC's standard >> CONST_INT representation is to sign-extend the constant to 64 bits. >> E.g. a QImode value of 0b1000_0000 would be represented as -128. >> >> I think here we want the zero-extended form, so it might be safer to do: >> >> polynomial = simplify_gen_unary (ZERO_EXTEND, DImode, polynomial, >> crc_mode); >> rtx t1 = force_reg (DImode, polynomial); >> >> > + >> > + rtx a0 = expand_binop (DImode, xor_optab, crc, data, NULL_RTX, 1, >> > + OPTAB_WIDEN); >> > + >> > + rtx clmul_res = gen_reg_rtx (TImode); >> > + emit_insn (gen_aarch64_crypto_pmulldi (clmul_res, a0, t0)); >> > + a0 = gen_lowpart (DImode, clmul_res); >> > + >> > + a0 = expand_shift (RSHIFT_EXPR, DImode, a0, crc_size, NULL_RTX, 1); >> > + >> > + emit_insn (gen_aarch64_crypto_pmulldi (clmul_res, a0, t1)); >> > + a0 = gen_lowpart (DImode, clmul_res); >> > + >> > + if (crc_size > data_size) >> > + { >> > + rtx crc_part = expand_shift (LSHIFT_EXPR, DImode, operands[1], >> data_size, >> > + NULL_RTX, 0); >> > + a0 = expand_binop (DImode, xor_optab, a0, crc_part, NULL_RTX, 1, >> > + OPTAB_DIRECT); >> >> Formatting nit: extra space after "a0 = " >> >> > + } >> > + /* Zero upper bits beyond crc_size. */ >> > + rtx num_shift = gen_int_mode (64 - crc_size, DImode); >> > + a0 = expand_shift (LSHIFT_EXPR, DImode, a0, 64 - crc_size, NULL_RTX, >> 0); >> > + a0 = expand_shift (RSHIFT_EXPR, DImode, a0, 64 - crc_size, NULL_RTX, >> 1); >> >> Rather than shift left and then right, I think we should just AND: >> >> rtx mask = gen_int_mode (GET_MODE_MASK (crc_mode), DImode); >> a0 = expand_binop (DImode, and_optab, a0, mask, NULL_RTX, 1, >> OPTAB_DIRECT); >> >> That said, it looks like operands[0] has crc_mode. The register bits >> above crc_size therefore shouldn't matter, since they're undefined on read. >> E.g. even though (reg:SI R) is stored in an X register, only the low 32 >> bits are defined; the upper 32 bits can be any value. >> >> So I'd expect we could replace this and... >> >> > + >> > + rtx tgt = simplify_gen_subreg (DImode, operands[0], >> > + GET_MODE (operands[0]), 0); >> > + aarch64_emit_move (tgt, a0); >> >> ...this with just: >> >> aarch64_emitmove (operands[0], gen_lowpart (crc_mode, a0)); >> >> Perhaps that would break down if operands[0] is a subreg with >> SUBREG_PROMOTED_VAR_P set, but I think it's up to target-independent >> code to handle that case. >> > > Thank you once again for your review. > I made all the recommended changes except for this and sent the new version. > On some tests the CRC function was returning the whole computed value, > without zeroing the upper part.
Thanks for the update and sorry for the (very) slow review. I tried this locally, and it looks like the need to zero is coming from the SUBREG_PROMOTED_VAR_P thing after all. In expand_crc_optab_fn rtx dest = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE); is returning a subreg like: (subreg/s/v:SI (reg:DI R) 0) where /s == SUBREG_PROMOTED_VAR_P and /v == SUBREG_PROMOTED_UNSIGNED_P. This means that the R is known to be a zero extension of a 32-bit value on read and must be set to such a value on write. In the patch, operands[0] is the subreg above and so things break if we don't set R to a zero-extended value,. But optabs aren't generally required to handle this kind of subreg as a destination. It's usually up to target-independent code to deal with them instead. expand_fn_using_insn handles this kind of situation by creating a temporary SI destination, then copying a zero-extension of the temporary back to R. I've pushed 5686d3b8ae16d9aeea8d39a56ec6f8ecee661e01 to split this code out into helper routines so that other functions in internal-fn.cc can use them. Applying: @@ -3984,15 +3997,14 @@ expand_crc_optab_fn (internal_fn fn, gcall *stmt, convert_optab optab) (fn, tree_pair (data_type, result_type), OPTIMIZE_FOR_SPEED)) { class expand_operand ops[4]; - create_output_operand (&ops[0], dest, TYPE_MODE (result_type)); + create_call_lhs_operand (&ops[0], dest, TYPE_MODE (result_type)); create_input_operand (&ops[1], crc, TYPE_MODE (result_type)); create_input_operand (&ops[2], data, TYPE_MODE (data_type)); create_input_operand (&ops[3], polynomial, TYPE_MODE (result_type)); insn_code icode = convert_optab_handler (optab, TYPE_MODE (data_type), TYPE_MODE (result_type)); expand_insn (icode, 4, ops); - if (!rtx_equal_p (dest, ops[0].value)) - emit_move_insn (dest, ops[0].value); + assign_call_lhs (lhs, dest, &ops[0]); } else { on top of that seems to avoid the need for the zero extension. > I'm also unsure about the indentation. Last time, I thought I had set it > correctly, but it turned out to be incorrect. This time, I've adjusted the > settings, but I'm still not sure if it's correct. Thanks. The indentation mostly looks good. I think the main remaining thing is that arguments tend to be indented too far. E.g.: aarch64_expand_crc_using_pmul (scalar_mode crc_mode, scalar_mode data_mode, rtx *operands) or, as spaces: aarch64_expand_crc_using_pmul (scalar_mode crc_mode, scalar_mode data_mode, rtx *operands) This is probably just because of GCC's unfortunate insistence on using tabs to indent by 8 characters, which feels very anachronistic these days. From a quick check, it looks like in all instances of the above, the arguments are indented by N columns too many if the line is indented by N tabs. That sounds consistent with a tab width of 7 rather than 8, but maybe it's something else. I can fix the indentation before pushing though, if that's more convenient. Richard