Re: [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Complex Addition, Multiply and FMA.

Richard Sandiford via Gcc-patches Mon, 14 Dec 2020 05:30:39 -0800

Rearranging slightly…

> @@ -708,6 +713,10 @@ (define_c_enum "unspec"
>      UNSPEC_FCMLA90   ; Used in aarch64-simd.md.
>      UNSPEC_FCMLA180  ; Used in aarch64-simd.md.
>      UNSPEC_FCMLA270  ; Used in aarch64-simd.md.
> +    UNSPEC_FCMUL     ; Used in aarch64-simd.md.
> +    UNSPEC_FCMUL180  ; Used in aarch64-simd.md.
> +    UNSPEC_FCMLS     ; Used in aarch64-simd.md.
> +    UNSPEC_FCMLS180  ; Used in aarch64-simd.md.
>      UNSPEC_ASRD              ; Used in aarch64-sve.md.
>      UNSPEC_ADCLB     ; Used in aarch64-sve2.md.
>      UNSPEC_ADCLT     ; Used in aarch64-sve2.md.
> @@ -726,6 +735,10 @@ (define_c_enum "unspec"
>      UNSPEC_CMLA180   ; Used in aarch64-sve2.md.
>      UNSPEC_CMLA270   ; Used in aarch64-sve2.md.
>      UNSPEC_CMLA90    ; Used in aarch64-sve2.md.
> +    UNSPEC_CMLS              ; Used in aarch64-sve2.md.
> +    UNSPEC_CMLS180   ; Used in aarch64-sve2.md.
> +    UNSPEC_CMUL              ; Used in aarch64-sve2.md.
> +    UNSPEC_CMUL180   ; Used in aarch64-sve2.md.
>      UNSPEC_COND_FCVTLT       ; Used in aarch64-sve2.md.
>      UNSPEC_COND_FCVTNT       ; Used in aarch64-sve2.md.
>      UNSPEC_COND_FCVTX        ; Used in aarch64-sve2.md.
> […]
> @@ -3418,7 +3457,85 @@ (define_int_attr rot [(UNSPEC_CADD90 "90")
>                     (UNSPEC_COND_FCMLA "0")
>                     (UNSPEC_COND_FCMLA90 "90")
>                     (UNSPEC_COND_FCMLA180 "180")
> -                   (UNSPEC_COND_FCMLA270 "270")])
> +                   (UNSPEC_COND_FCMLA270 "270")
> +                   (UNSPEC_FCMUL "0")
> +                   (UNSPEC_FCMUL180 "180")])
> +
> +;; A conjucate is a negation of the imaginary component
> +;; The number in the inspecs are the rotation component of the instruction, 
> e.g


unspecs

> +;; FCMLS180 means use the instruction with #180.
> +;; The iterator is used to produce the right name mangling for the function.
> +;;
> +;; The rotation value does not directly correlate to a rotation along the 
> argant
> +;; plane as the instructions only perform half the computation.
> +;;
> +;; For the implementation we threat any rotation by 0 as normal and 180 as
> +;; conjucate.  This is only for implementing the vectorizer patterns.
> +(define_int_attr rot_op [(UNSPEC_FCMLS "")
> +                      (UNSPEC_FCMLS180 "_conj")
> +                      (UNSPEC_FCMLA "")
> +                      (UNSPEC_FCMLA180 "_conj")
> +                      (UNSPEC_FCMUL "")
> +                      (UNSPEC_FCMUL180 "_conj")
> +                      (UNSPEC_CMLS "")
> +                      (UNSPEC_CMLA "")
> +                      (UNSPEC_CMLA180 "_conj")
> +                      (UNSPEC_CMUL "")
> +                      (UNSPEC_CMUL180 "_conj")])
> +

Realise this is being awkward, sorry, but: I think it would be clearer
to use unspecs with _CONJ in the name for the conjugated forms, rather
than explain away how 180 is being used.

All four rotations make sense as a pair (i.e. as a full complex
multiplication), rather than just as a standalone instruction:

  FMLA    = a + b * c
  FMLA90  = a + b * c * i
  FMLA180 = a - b * c
  FMLA270 = a - b * c * i

I.e. FMLA180 == FMLS and FMLA270 == “FMLS90” (a made-up term).
“FMLS180” is equivalent to FMLA.

We can then additionally conjugate “b” in each of the four forms above.

At the moment, FMLA90 and FMLA270/FMLS90 (and their conjugate forms)
aren't exposed as optabs, but that could change in future.

So I think we should avoid introducing FMLS and FMLS180 and instead use
(name in patch -> suggested replacement):

  FMLA -> FMLA
  FMLS -> FMLA180
  FMLA180 -> FMLA_CONJ
  FMLS180 -> FMLA180_CONJ

where _CONJ has the effect of adding 180 degrees to the second
rotation amount.

Then:

> +;; The complex operations when performed on a real complex number require two
> +;; instructions to perform the operation. e.g. complex multiplication 
> requires
> +;; two FCMUL with a particular rotation value.
> +;;
> +;; These values can be looked up in rotsplit1 and rotsplit2.  as an example
> +;; FCMUL needs the first instruction to use #0 and the second #90.
> +(define_int_attr rotsplit1 [(UNSPEC_FCMLA "0")
> +                         (UNSPEC_FCMLA180 "0")
> +                         (UNSPEC_FCMUL "0")
> +                         (UNSPEC_FCMUL180 "0")
> +                         (UNSPEC_FCMLS "270")
> +                         (UNSPEC_FCMLS180 "90")])
> +
> +(define_int_attr rotsplit2 [(UNSPEC_FCMLA "90")
> +                         (UNSPEC_FCMLA180 "270")
> +                         (UNSPEC_FCMUL "90")
> +                         (UNSPEC_FCMUL180 "270")
> +                         (UNSPEC_FCMLS "180")
> +                         (UNSPEC_FCMLS180 "180")])

would become something like:

(define_int_attr rotsplit1 [(UNSPEC_FCMLA "0")
                            (UNSPEC_FCMLA_CONJ "0")
                            (UNSPEC_FCMLA180 "180")
                            (UNSPEC_FCMLA180_CONJ "180")
                            (UNSPEC_FCMUL "0")
                            (UNSPEC_FCMUL_CONJ "0")])

(define_int_attr rotsplit2 [(UNSPEC_FCMLA "90")
                            (UNSPEC_FCMLA_CONJ "270")
                            (UNSPEC_FCMLA180 "270")
                            (UNSPEC_FCMLA180_CONJ "90")
                            (UNSPEC_FCMUL "90")
                            (UNSPEC_FCMUL_CONJ "270")])

IMO the mapping between unspec and rotation is then more obvious:
adding _CONJ adds 180 to rotsplt2, but doesn't change rotsplit1.

(The patch uses 270 as the first rotation for FMLS and 180 as the second
rotation, instead of the other way around.  That would be OK too if it's
better for some reason, but I think it would be worth a comment if so.)

> […]
> +;; unpredicated optab pattern for auto-vectorizer
> +;; The complex mla/mls operations always need to expand to two instructions.
> +;; The first operation does half the computation and the second does the
> +;; remainder.  Because of this, expand early.
> +(define_expand "cml<fcmac1><rot_op><mode>4"
> +  [(set (match_operand:SVE_FULL_F 0 "register_operand")
> +     (unspec:SVE_FULL_F
> +       [(match_dup 4)
> +        (match_dup 5)
> +        (match_operand:SVE_FULL_F 1 "register_operand")
> +        (match_operand:SVE_FULL_F 2 "register_operand")
> +        (match_operand:SVE_FULL_F 3 "register_operand")]
> +       FCMLA_OP))]
> +  "TARGET_SVE && !BYTES_BIG_ENDIAN"
> +{
> +  operands[4] = aarch64_ptrue_reg (<VPRED>mode);
> +  operands[5] = gen_int_mode (SVE_RELAXED_GP, SImode);
> +  rtx tmp = gen_reg_rtx (<MODE>mode);
> +  emit_insn (
> +    gen_aarch64_pred_fcmla<sve_rot1><mode> (tmp, operands[4],
> +                                         operands[1], operands[2],
> +                                         operands[3], operands[5]));
> +  emit_insn (
> +    gen_aarch64_pred_fcmla<sve_rot2><mode> (operands[0], operands[4],
> +                                         tmp, operands[2],
> +                                         operands[3], operands[5]));

Very minor, sorry, but: should be no “(”s at the end of a line:

  emit_insn
    (gen_aarch64_pred_fcmla<sve_rot2><mode> (operands[0], operands[4],
                                             tmp, operands[2],
                                             operands[3], operands[5]));

Same for the rest of the patch.

Are you sure the operands above are the right way round though?
The pattern is:

;; Predicated FCMLA.
(define_insn "@aarch64_pred_<optab><mode>"
  [(set (match_operand:SVE_FULL_F 0 "register_operand" "=w, ?&w")
        (unspec:SVE_FULL_F
          [(match_operand:<VPRED> 1 "register_operand" "Upl, Upl")
           (match_operand:SI 5 "aarch64_sve_gp_strictness")
           (match_operand:SVE_FULL_F 2 "register_operand" "w, w")
           (match_operand:SVE_FULL_F 3 "register_operand" "w, w")
           (match_operand:SVE_FULL_F 4 "register_operand" "0, w")]
          SVE_COND_FCMLA))]
  "TARGET_SVE"
  "@
   fcmla\t%0.<Vetype>, %1/m, %2.<Vetype>, %3.<Vetype>, #<rot>
   movprfx\t%0, %4\;fcmla\t%0.<Vetype>, %1/m, %2.<Vetype>, %3.<Vetype>, #<rot>"
  [(set_attr "movprfx" "*,yes")]
)

where operand 4 is the accumulator.  The code above seems to be passing
the accumulator as operand 2 instead.

I agree this is confusing.  The current SVE_COND_FCMLA order is the same
as the (fma …) rtx code (and thus the same as the FMLA patterns), which
seemed appropriate given that it really is a fused operation.

> +  DONE;
> +})
> +
> +;; unpredicated optab pattern for auto-vectorizer
> +;; The complex mul operations always need to expand to two instructions.
> +;; The first operation does half the computation and the second does the
> +;; remainder.  Because of this, expand early.
> +(define_expand "cmul<rot_op><mode>3"
> +  [(set (match_operand:SVE_FULL_F 0 "register_operand")
> +     (unspec:SVE_FULL_F
> +       [(match_dup 3)
> +        (match_dup 4)
> +        (match_operand:SVE_FULL_F 1 "register_operand")
> +        (match_operand:SVE_FULL_F 2 "register_operand")
> +        (match_dup 5)]
> +       FCMUL_OP))]
> +  "TARGET_SVE && !BYTES_BIG_ENDIAN"
> +{
> +  operands[3] = aarch64_ptrue_reg (<VPRED>mode);
> +  operands[4] = gen_int_mode (SVE_RELAXED_GP, SImode);
> +  operands[5] = force_reg (<MODE>mode, CONST0_RTX (<MODE>mode));

I can see that it makes sense to use (match_dup 4) and (match_dup 5)
in the FCMLA_OP expander above, so that the unspec has the same number
of operands as it does in “real” FCMLA patterns.  It doesn't matter in
practice, since the unspec is never generated, but I agree it's less
confusing.

But in this case, FCMUL_OP is a purely synthetic thing, so I think it
would be better to use local variables instead, like for the
Advanced SIMD patterns:

  rtx ptrue = aarch64_ptrue_reg (<VPRED>mode);
  rtx ptrue_hint = gen_int_mode (SVE_RELAXED_GP, SImode);
  rtx zero = force_reg (<MODE>mode, CONST0_RTX (<MODE>mode));

> […]
> +;; unpredicated optab pattern for auto-vectorizer
> +;; The complex mla/mls operations always need to expand to two instructions.
> +;; The first operation does half the computation and the second does the
> +;; remainder.  Because of this, expand early.
> +(define_expand "cml<fcmac1><rot_op><mode>4"
> +  [(set (match_operand:SVE_FULL_I 0 "register_operand")
> +     (plus:SVE_FULL_I (match_operand:SVE_FULL_I 1 "register_operand")
> +       (unspec:SVE_FULL_I
> +         [(match_operand:SVE_FULL_I 2 "register_operand")
> +          (match_operand:SVE_FULL_I 3 "register_operand")]
> +         SVE2_INT_CMLA_OP)))]

The SVE2_INT_CMLA_OP unspecs take three operands, with the accumulator
first this time (since it's not an fma).  So I think this should simply be:

(define_expand "cml<fcmac1><rot_op><mode>4"
  [(set (match_operand:SVE_FULL_I 0 "register_operand")
        (unspec:SVE_FULL_I
          [(match_operand:SVE_FULL_I 1 "register_operand")
           (match_operand:SVE_FULL_I 2 "register_operand")
           (match_operand:SVE_FULL_I 3 "register_operand")]
          SVE2_INT_CMLA_OP))]

> +  "TARGET_SVE2 && !BYTES_BIG_ENDIAN"
> +{
> +  rtx tmp = gen_reg_rtx (<MODE>mode);
> +  emit_insn (gen_aarch64_sve_cmla<sve_rot1><mode> (tmp, operands[1],
> +                                                operands[2], operands[3]));
> +  emit_insn (gen_aarch64_sve_cmla<sve_rot2><mode> (operands[0], tmp,
> +                                                operands[2], operands[3]));
> +  DONE;
> +})
> +
> +;; unpredicated optab pattern for auto-vectorizer
> +;; The complex mul operations always need to expand to two instructions.
> +;; The first operation does half the computation and the second does the
> +;; remainder.  Because of this, expand early.
> +(define_expand "cmul<rot_op><mode>3"
> +  [(set (match_operand:SVE_FULL_I 0 "register_operand")
> +     (unspec:SVE_FULL_I
> +       [(match_operand:SVE_FULL_I 1 "register_operand")
> +        (match_operand:SVE_FULL_I 2 "register_operand")
> +        (match_dup 3)]
> +       SVE2_INT_CMUL_OP))]
> +  "TARGET_SVE2 && !BYTES_BIG_ENDIAN"
> +{
> +  operands[3] = force_reg (<MODE>mode, CONST0_RTX (<MODE>mode));

Here too I think it would be better to use a temporary variable instead
of a (match_dup …) for the “fake” CMUL operations.

> +  rtx tmp = gen_reg_rtx (<MODE>mode);
> +  emit_insn (gen_aarch64_sve_cmla<sve_rot1><mode> (tmp, operands[3],
> +                                                operands[1], operands[2]));
> +  emit_insn (gen_aarch64_sve_cmla<sve_rot2><mode> (operands[0], tmp,
> +                                                operands[1], operands[2]));
> +  DONE;
> +})

Thanks,
Richard

Re: [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Complex Addition, Multiply and FMA.

Reply via email to