Soumya AR <soum...@nvidia.com> writes:
> From 7fafcb5e0174c56205ec05406c9a412196ae93d3 Mon Sep 17 00:00:00 2001
> From: Soumya AR <soum...@nvidia.com>
> Date: Thu, 3 Oct 2024 11:53:07 +0530
> Subject: [PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE instruction
>
> This patch uses the FSCALE instruction provided by SVE to implement the
> standard ldexp family of functions.
>
> Currently, with '-Ofast -mcpu=neoverse-v2', GCC generates libcalls for the
> following code:
>
> float
> test_ldexpf (float x, int i)
> {
>       return __builtin_ldexpf (x, i);
> }
>
> double
> test_ldexp (double x, int i)
> {
>       return __builtin_ldexp(x, i);
> }
>
> GCC Output:
>
> test_ldexpf:
>       b ldexpf
>
> test_ldexp:
>       b ldexp
>
> Since SVE has support for an FSCALE instruction, we can use this to process
> scalar floats by moving them to a vector register and performing an fscale 
> call,
> similar to how LLVM tackles an ldexp builtin as well.
>
> New Output:
>
> test_ldexpf:
>       fmov s31, w0
>       ptrue p7.b, all
>       fscale z0.s, p7/m, z0.s, z31.s
>       ret
>
> test_ldexp:
>       sxtw x0, w0
>       ptrue p7.b, all
>       fmov d31, x0
>       fscale z0.d, p7/m, z0.d, z31.d
>       ret
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?

Could we also use the .H form for __builtin_ldexpf16?

I suppose:

> @@ -2286,7 +2289,8 @@
>                        (VNx8DI "VNx2BI") (VNx8DF "VNx2BI")
>                        (V8QI "VNx8BI") (V16QI "VNx16BI")
>                        (V4HI "VNx4BI") (V8HI "VNx8BI") (V2SI "VNx2BI")
> -                      (V4SI "VNx4BI") (V2DI "VNx2BI")])
> +                      (V4SI "VNx4BI") (V2DI "VNx2BI")
> +                      (SF "VNx4BI") (DF "VNx2BI")])

...this again raises the question what we should do for predicate
modes when the data mode isn't a natural SVE mode.  That came up
recently in relation to V1DI in the popcount patch, and for reductions
in the ANDV etc. patch.

Three obvious options are:

(1) Use the nearest SVE mode with a full ptrue (as the patch does).
(2) Use the nearest SVE mode with a 128-bit ptrue.
(3) Add new modes V16BI, V8BI, V4BI, V2BI, and V1BI.  (And possibly BI
    for scalars.)

The problem with (1) is that, as Tamar pointed out, it doesn't work
properly with reductions.  It also isn't safe for this patch (without
fast-mathy options) because of FP exceptions.  Although writing to
a scalar FP register zeros the upper bits, and so gives a "safe" value
for this particular operation, nothing guarantees that all SF and DF
values have this zero-extended form.  They could come from subregs of
Advanced SIMD or SVE vectors.  The ABI also doesn't guarantee that
incoming SF and DF values are zero-extended.

(2) would be safe, but would mean that we continue to have an nunits
disagreement between the data mode and the predicate mode.  This would
prevent operations being described in generic RTL in future.

(3) is probably the cleanest representional approach, but has the problem
that we cannot store a fixed-length portion of an SVE predicate.
We would have to load and store the modes via other register classes.
(With PMOV we could use scalar FP loads and stores, but otherwise
we'd probably need secondary memory reloads.)  That said, we could
tell the RA to spill in a full predicate mode, so this shouldn't be
a problem unless the modes somehow get exposed to gimple or frontends.

WDYT?

Richard

>  ;; ...and again in lower case.
>  (define_mode_attr vpred [(VNx16QI "vnx16bi") (VNx8QI "vnx8bi")
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/fscale.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/fscale.c
> new file mode 100644
> index 00000000000..251b4ef9188
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/fscale.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-Ofast" } */
> +
> +float
> +test_ldexpf (float x, int i)
> +{
> +  return __builtin_ldexpf (x, i);
> +}
> +/* { dg-final { scan-assembler-times {\tfscale\tz[0-9]+\.s, p[0-7]/m, 
> z[0-9]+\.s, z[0-9]+\.s\n} 1 } } */
> +
> +double
> +test_ldexp (double x, int i)
> +{
> +  return __builtin_ldexp (x, i);
> +} 
> +/* { dg-final { scan-assembler-times {\tfscale\tz[0-9]+\.d, p[0-7]/m, 
> z[0-9]+\.d, z[0-9]+\.d\n} 1 } } */

Reply via email to