Soumya AR <soum...@nvidia.com> writes: > From 7fafcb5e0174c56205ec05406c9a412196ae93d3 Mon Sep 17 00:00:00 2001 > From: Soumya AR <soum...@nvidia.com> > Date: Thu, 3 Oct 2024 11:53:07 +0530 > Subject: [PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE instruction > > This patch uses the FSCALE instruction provided by SVE to implement the > standard ldexp family of functions. > > Currently, with '-Ofast -mcpu=neoverse-v2', GCC generates libcalls for the > following code: > > float > test_ldexpf (float x, int i) > { > return __builtin_ldexpf (x, i); > } > > double > test_ldexp (double x, int i) > { > return __builtin_ldexp(x, i); > } > > GCC Output: > > test_ldexpf: > b ldexpf > > test_ldexp: > b ldexp > > Since SVE has support for an FSCALE instruction, we can use this to process > scalar floats by moving them to a vector register and performing an fscale > call, > similar to how LLVM tackles an ldexp builtin as well. > > New Output: > > test_ldexpf: > fmov s31, w0 > ptrue p7.b, all > fscale z0.s, p7/m, z0.s, z31.s > ret > > test_ldexp: > sxtw x0, w0 > ptrue p7.b, all > fmov d31, x0 > fscale z0.d, p7/m, z0.d, z31.d > ret > > The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. > OK for mainline?
Could we also use the .H form for __builtin_ldexpf16? I suppose: > @@ -2286,7 +2289,8 @@ > (VNx8DI "VNx2BI") (VNx8DF "VNx2BI") > (V8QI "VNx8BI") (V16QI "VNx16BI") > (V4HI "VNx4BI") (V8HI "VNx8BI") (V2SI "VNx2BI") > - (V4SI "VNx4BI") (V2DI "VNx2BI")]) > + (V4SI "VNx4BI") (V2DI "VNx2BI") > + (SF "VNx4BI") (DF "VNx2BI")]) ...this again raises the question what we should do for predicate modes when the data mode isn't a natural SVE mode. That came up recently in relation to V1DI in the popcount patch, and for reductions in the ANDV etc. patch. Three obvious options are: (1) Use the nearest SVE mode with a full ptrue (as the patch does). (2) Use the nearest SVE mode with a 128-bit ptrue. (3) Add new modes V16BI, V8BI, V4BI, V2BI, and V1BI. (And possibly BI for scalars.) The problem with (1) is that, as Tamar pointed out, it doesn't work properly with reductions. It also isn't safe for this patch (without fast-mathy options) because of FP exceptions. Although writing to a scalar FP register zeros the upper bits, and so gives a "safe" value for this particular operation, nothing guarantees that all SF and DF values have this zero-extended form. They could come from subregs of Advanced SIMD or SVE vectors. The ABI also doesn't guarantee that incoming SF and DF values are zero-extended. (2) would be safe, but would mean that we continue to have an nunits disagreement between the data mode and the predicate mode. This would prevent operations being described in generic RTL in future. (3) is probably the cleanest representional approach, but has the problem that we cannot store a fixed-length portion of an SVE predicate. We would have to load and store the modes via other register classes. (With PMOV we could use scalar FP loads and stores, but otherwise we'd probably need secondary memory reloads.) That said, we could tell the RA to spill in a full predicate mode, so this shouldn't be a problem unless the modes somehow get exposed to gimple or frontends. WDYT? Richard > ;; ...and again in lower case. > (define_mode_attr vpred [(VNx16QI "vnx16bi") (VNx8QI "vnx8bi") > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/fscale.c > b/gcc/testsuite/gcc.target/aarch64/sve/fscale.c > new file mode 100644 > index 00000000000..251b4ef9188 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/fscale.c > @@ -0,0 +1,16 @@ > +/* { dg-do compile } */ > +/* { dg-additional-options "-Ofast" } */ > + > +float > +test_ldexpf (float x, int i) > +{ > + return __builtin_ldexpf (x, i); > +} > +/* { dg-final { scan-assembler-times {\tfscale\tz[0-9]+\.s, p[0-7]/m, > z[0-9]+\.s, z[0-9]+\.s\n} 1 } } */ > + > +double > +test_ldexp (double x, int i) > +{ > + return __builtin_ldexp (x, i); > +} > +/* { dg-final { scan-assembler-times {\tfscale\tz[0-9]+\.d, p[0-7]/m, > z[0-9]+\.d, z[0-9]+\.d\n} 1 } } */