Dennis Zhang <dennis.zh...@arm.com> writes: > diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def > b/gcc/config/aarch64/aarch64-simd-builtins.def > index 332a0b6b1ea..39ebb776d1d 100644 > --- a/gcc/config/aarch64/aarch64-simd-builtins.def > +++ b/gcc/config/aarch64/aarch64-simd-builtins.def > @@ -719,6 +719,9 @@ > VAR1 (QUADOP_LANE, bfmlalb_lane_q, 0, ALL, v4sf) > VAR1 (QUADOP_LANE, bfmlalt_lane_q, 0, ALL, v4sf) > > + /* Implemented by aarch64_vget_halfv8bf. */ > + VAR1 (GETREG, vget_half, 0, ALL, v8bf)
This should be AUTO_FP, since it doesn't have any side-effects. (As before, we should probably rename the flag, but that's separate work.) > + > /* Implemented by aarch64_simd_<sur>mmlav16qi. */ > VAR1 (TERNOP, simd_smmla, 0, NONE, v16qi) > VAR1 (TERNOPU, simd_ummla, 0, NONE, v16qi) > diff --git a/gcc/config/aarch64/aarch64-simd.md > b/gcc/config/aarch64/aarch64-simd.md > index 9f0e2bd1e6f..f62c52ca327 100644 > --- a/gcc/config/aarch64/aarch64-simd.md > +++ b/gcc/config/aarch64/aarch64-simd.md > @@ -7159,6 +7159,19 @@ > [(set_attr "type" "neon_dot<VDQSF:q>")] > ) > > +;; vget_low/high_bf16 > +(define_expand "aarch64_vget_halfv8bf" > + [(match_operand:V4BF 0 "register_operand") > + (match_operand:V8BF 1 "register_operand") > + (match_operand:SI 2 "aarch64_zero_or_1")] > + "TARGET_BF16_SIMD" > +{ > + int hbase = INTVAL (operands[2]); > + rtx sel = aarch64_gen_stepped_int_parallel (4, hbase * 4, 1); I think this needs to be: aarch64_simd_vect_par_cnst_half instead. The issue is that on big-endian targets, GCC assumes vector lane 0 is in the high part of the register, whereas for AArch64 it's always in the low part of the register. So we convert from AArch64 numbering to GCC numbering when generating the rtx and then take endianness into account when matching the rtx later. It would be good to have -mbig-endian tests that make sure we generate the right instruction for each function (i.e. we get them the right way round). I guess it would be good to test that for little-endian too. > + emit_insn (gen_aarch64_get_halfv8bf (operands[0], operands[1], sel)); > + DONE; > +}) > + > ;; bfmmla > (define_insn "aarch64_bfmmlaqv4sf" > [(set (match_operand:V4SF 0 "register_operand" "=w") > diff --git a/gcc/config/aarch64/predicates.md > b/gcc/config/aarch64/predicates.md > index 215fcec5955..0c8bc2b0c73 100644 > --- a/gcc/config/aarch64/predicates.md > +++ b/gcc/config/aarch64/predicates.md > @@ -84,6 +84,10 @@ > (ior (match_test "op == constm1_rtx") > (match_test "op == const1_rtx")))))) > > +(define_predicate "aarch64_zero_or_1" > + (and (match_code "const_int") > + (match_test "op == const0_rtx || op == const1_rtx"))) zero_or_1 looked odd to me, feels like it should be 0_or_1 or zero_or_one. But I see that it's for consistency with aarch64_reg_zero_or_m1_or_1, so let's keep it as-is. Thanks, Richard