Re: [PATCH 1/3] aarch64: Add support for fp8 convert and scale

Saurabh Jha Thu, 07 Nov 2024 06:23:46 -0800



On 11/7/2024 9:03 AM, Kyrylo Tkachov wrote:

Hi Saurabh,

On 6 Nov 2024, at 11:03, saurabh....@arm.com wrote:


The AArch64 FEAT_FP8 extension introduces instructions for conversion
and scaling.

This patch introduces the following intrinsics:
1. vcvt{1|2}_{bf16|high_bf16|low_bf16}_mf8_fpm.
2. vcvt{q}_mf8_f16_fpm.
3. vcvt_{high}_mf8_f32_fpm.
4. vscale{q}_{f16|f32|f64}.

We introduced three new aarch64_builtin_signatures enum variants:
1. binary_fpm.
2. ternary_fpm.
3. unary_fpm.

We added support for these variants for declaring types and for expanding to 
RTL.

We added new simd_types for integers (s32, s32q, and s64q) and for
fp8 (f8, and f8q).

Also changed the faminmax intrinsic instruction pattern so that it works
better with the new fscale pattern.

Because we added support for fp8 intrinsics here, we modified the check
in acle/fp8.c that was checking that __ARM_FEATURE_FP8 macro is not
defined.

gcc/ChangeLog:

* config/aarch64/aarch64-builtins.cc
(enum class): New variants to support new signatures.
(aarch64_fntype): Handle new signatures.
(aarch64_expand_pragma_builtin): Handle new signatures.
* config/aarch64/aarch64-c.cc
(aarch64_update_cpp_builtins): New flag for FP8.
* config/aarch64/aarch64-simd-pragma-builtins.def
(ENTRY_BINARY_FPM): Macro to declare unary fpm intrinsics.
(ENTRY_TERNARY_FPM): Macro to declare ternary fpm intrinsics.
(ENTRY_UNARY_FPM): Macro to declare unary fpm intrinsics.
(ENTRY_VHSDF_VHSDI): Macro to declare binary intrinsics.
* config/aarch64/aarch64-simd.md
(@aarch64_<faminmax_uns_op><mode>): Renamed.
(@aarch64_<faminmax_uns_op><VHSDF:mode><VHSDF:mode>): Renamed.
(@aarch64_<fpm_uns_name><V8HFBF:mode><VB:mode>): Unary fpm
pattern.
(@aarch64_<fpm_uns_name><V8HFBF:mode><V16QI_ONLY:mode>): Unary
fpm pattern.
(@aarch64_<fpm_uns_name><VB:mode><VCVTFPM:mode><VH_SF:mode>):
Binary fpm pattern.
(@aarch64_<fpm_uns_name><V16QI_ONLY:mode><V8QI_ONLY:mode><V4SF_ONLY:mode><V4SF_ONLY:mode>):
Ternary fpm pattern.
(@aarch64_<fpm_uns_op><VHSDF:mode><VHSDI:mode>): Scale fpm
pattern.
* config/aarch64/iterators.md: New attributes and iterators.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/acle/fp8.c: Remove check that fp8 feature
macro doesn't exist.
* gcc.target/aarch64/simd/scale_fpm.c: New test.
* gcc.target/aarch64/simd/vcvt_fpm.c: New test.

---

I could not find a way to compress declarations in
aarch64-simd-pragma-builtins.def for convert instructions as there was
no pattern apart from the repetion for vcvt1/vcvt2 types. Let me know
if those declrations can be expressed more concisely.

In the scale instructions, I am not doing any casting from float to int
modes in the second operand. Let me know if that's a problem.
---
gcc/config/aarch64/aarch64-builtins.cc        | 132 ++++++++++--
gcc/config/aarch64/aarch64-c.cc               |   2 +
.../aarch64/aarch64-simd-pragma-builtins.def  |  56 +++++
gcc/config/aarch64/aarch64-simd.md            |  72 ++++++-
gcc/config/aarch64/iterators.md               |  99 +++++++++
gcc/testsuite/gcc.target/aarch64/acle/fp8.c   |  10 -
.../gcc.target/aarch64/simd/scale_fpm.c       |  60 ++++++
.../gcc.target/aarch64/simd/vcvt_fpm.c        | 197 ++++++++++++++++++
8 files changed, 603 insertions(+), 25 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/scale_fpm.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/vcvt_fpm.c

<0001-aarch64-Add-support-for-fp8-convert-and-scale.patch>


diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index cfe95bd4c31..87bbfb0e586 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -9982,13 +9982,13 @@
  )

;; faminmax

-(define_insn "@aarch64_<faminmax_uns_op><mode>"
+(define_insn "@aarch64_<faminmax_uns_op><VHSDF:mode><VHSDF:mode>"
    [(set (match_operand:VHSDF 0 "register_operand" "=w")
        (unspec:VHSDF [(match_operand:VHSDF 1 "register_operand" "w")
                       (match_operand:VHSDF 2 "register_operand" "w")]
                      FAMINMAX_UNS))]
    "TARGET_FAMINMAX"
-  "<faminmax_uns_op>\t%0.<Vtype>, %1.<Vtype>, %2.<Vtype>"
+  "<faminmax_uns_op>\t%0.<Vtype>, %1.<VHSDF:Vtype>, %2.<VHSDF:Vtype>"
  )

(define_insn "*aarch64_faminmax_fused"

@@ -9999,3 +9999,71 @@
    "TARGET_FAMINMAX"
    "<faminmax_op>\t%0.<Vtype>, %1.<Vtype>, %2.<Vtype>"
  )
+
+;; fpm unary instructions.
+(define_insn "@aarch64_<fpm_uns_name><V8HFBF:mode><VB:mode>"
+  [(set (match_operand:V8HFBF 0 "register_operand" "=w")
+       (unspec:V8HFBF
+        [(match_operand:VB 1 "register_operand" "w")
+         (reg:DI FPM_REGNUM)]
+       FPM_UNARY_UNS))]
+  "TARGET_FP8"
+  "<fpm_uns_op>\t%0.<V8HFBF:Vtype>, %1.<VB:Vtype>"
+)
+
+;; fpm unary instructions, where the input is lowered from V16QI to
+;; V8QI.
+(define_insn "@aarch64_<fpm_uns_name><V8HFBF:mode><V16QI_ONLY:mode>"
+  [(set (match_operand:V8HFBF 0 "register_operand" "=w")
+       (unspec:V8HFBF
+        [(match_operand:V16QI_ONLY 1 "register_operand" "w")
+         (reg:DI FPM_REGNUM)]
+       FPM_UNARY_LOW_UNS))]
+  "TARGET_FP8"
+  {
+    operands[1] = force_lowpart_subreg (V8QImode,
+                                       operands[1],
+                                       recog_data.operand[1]->mode);

I don’t think this is needed? This code is only executed in the final assembly 
output stage and you already explicitly print operand 1 with a “.8b” suffix so 
changing the mode here doesn’t matter.

Should we rather do this and remove the hardcoded ".8b" or the other wayaround? Is there some convention around this that we can follow here?


+    return "<fpm_uns_op>\t%0.<V8HFBF:Vtype>, %1.8b";
+  }
+)

+;; fpm ternary instructions.
+(define_insn
+  
"@aarch64_<fpm_uns_name><V16QI_ONLY:mode><V8QI_ONLY:mode><V4SF_ONLY:mode><V4SF_ONLY:mode>"
+  [(set (match_operand:V16QI_ONLY 0 "register_operand" "=w")
+       (unspec:V16QI_ONLY
+        [(match_operand:V8QI_ONLY 1 "register_operand" "w")
+         (match_operand:V4SF_ONLY 2 "register_operand" "w")
+         (match_operand:V4SF_ONLY 3 "register_operand" "w")
+         (reg:DI FPM_REGNUM)]
+       FPM_TERNARY_VCVT_UNS))]
+  "TARGET_FP8"
+  {
+    operands[1] = force_reg (V16QImode, operands[1]);
+    return "<fpm_uns_op>\t%1.16b, %2.<V4SF_ONLY:Vtype>, %3.<V4SF_ONLY:Vtype>";
+  }
+)

Same here. But more worryingly the destination operand 0 is not being printed 
out anywhere here. Was there supposed to be a tie of one of the input operands 
to operand 0 in this pattern?
I haven’t looked deeply into what exactly these instructions do, but please 
double check the operands here.

These fma intrinsics and the vdot intrinsics in the previous commit seemto write to their first operand (operand 1, vd) in the acle spec:

https://github.com/ARM-software/acle/blob/main/neon_intrinsics/advsimd.md#multiply-6

Is there another way we should be printing the assembly for these cases?

Many thanks,
Saurabh

Thanks,
Kyrill

Re: [PATCH 1/3] aarch64: Add support for fp8 convert and scale

Reply via email to