> On 3 Oct 2024, at 16:41, Richard Sandiford <[email protected]> wrote:
>
> External email: Use caution opening links or attachments
>
>
> Soumya AR <[email protected]> writes:
>> From 7fafcb5e0174c56205ec05406c9a412196ae93d3 Mon Sep 17 00:00:00 2001
>> From: Soumya AR <[email protected]>
>> Date: Thu, 3 Oct 2024 11:53:07 +0530
>> Subject: [PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE instruction
>>
>> This patch uses the FSCALE instruction provided by SVE to implement the
>> standard ldexp family of functions.
>>
>> Currently, with '-Ofast -mcpu=neoverse-v2', GCC generates libcalls for the
>> following code:
>>
>> float
>> test_ldexpf (float x, int i)
>> {
>> return __builtin_ldexpf (x, i);
>> }
>>
>> double
>> test_ldexp (double x, int i)
>> {
>> return __builtin_ldexp(x, i);
>> }
>>
>> GCC Output:
>>
>> test_ldexpf:
>> b ldexpf
>>
>> test_ldexp:
>> b ldexp
>>
>> Since SVE has support for an FSCALE instruction, we can use this to process
>> scalar floats by moving them to a vector register and performing an fscale
>> call,
>> similar to how LLVM tackles an ldexp builtin as well.
>>
>> New Output:
>>
>> test_ldexpf:
>> fmov s31, w0
>> ptrue p7.b, all
>> fscale z0.s, p7/m, z0.s, z31.s
>> ret
>>
>> test_ldexp:
>> sxtw x0, w0
>> ptrue p7.b, all
>> fmov d31, x0
>> fscale z0.d, p7/m, z0.d, z31.d
>> ret
>>
>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
>> OK for mainline?
>
> Could we also use the .H form for __builtin_ldexpf16?
>
> I suppose:
>
>> @@ -2286,7 +2289,8 @@
>> (VNx8DI "VNx2BI") (VNx8DF "VNx2BI")
>> (V8QI "VNx8BI") (V16QI "VNx16BI")
>> (V4HI "VNx4BI") (V8HI "VNx8BI") (V2SI "VNx2BI")
>> - (V4SI "VNx4BI") (V2DI "VNx2BI")])
>> + (V4SI "VNx4BI") (V2DI "VNx2BI")
>> + (SF "VNx4BI") (DF "VNx2BI")])
>
> ...this again raises the question what we should do for predicate
> modes when the data mode isn't a natural SVE mode. That came up
> recently in relation to V1DI in the popcount patch, and for reductions
> in the ANDV etc. patch.
Thanks you for enumerating the options below.
>
> Three obvious options are:
>
> (1) Use the nearest SVE mode with a full ptrue (as the patch does).
> (2) Use the nearest SVE mode with a 128-bit ptrue.
> (3) Add new modes V16BI, V8BI, V4BI, V2BI, and V1BI. (And possibly BI
> for scalars.)
Just to be clear, what do you mean by “nearest SVE mode” in this context?
>
> The problem with (1) is that, as Tamar pointed out, it doesn't work
> properly with reductions. It also isn't safe for this patch (without
> fast-mathy options) because of FP exceptions. Although writing to
> a scalar FP register zeros the upper bits, and so gives a "safe" value
> for this particular operation, nothing guarantees that all SF and DF
> values have this zero-extended form. They could come from subregs of
> Advanced SIMD or SVE vectors. The ABI also doesn't guarantee that
> incoming SF and DF values are zero-extended.
>
> (2) would be safe, but would mean that we continue to have an nunits
> disagreement between the data mode and the predicate mode. This would
> prevent operations being described in generic RTL in future.
>
> (3) is probably the cleanest representional approach, but has the problem
> that we cannot store a fixed-length portion of an SVE predicate.
> We would have to load and store the modes via other register classes.
> (With PMOV we could use scalar FP loads and stores, but otherwise
> we'd probably need secondary memory reloads.) That said, we could
> tell the RA to spill in a full predicate mode, so this shouldn't be
> a problem unless the modes somehow get exposed to gimple or frontends.
>
> WDYT?
IMO option (2) sounds the more appealing at this stage. To me it feels
conceptually straightforward as we are using a SVE operation clamped at
128 bits to “emulate” what should have been an 128-bit fixed-width mode
operation.
It also feels that, given the complexity of (3) and introducing new modes,
we should go for (3) only if/when we do decide to implement these ops with
generic RTL.
Thanks,
Kyrill
>
> Richard
>
>> ;; ...and again in lower case.
>> (define_mode_attr vpred [(VNx16QI "vnx16bi") (VNx8QI "vnx8bi")
>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/fscale.c
>> b/gcc/testsuite/gcc.target/aarch64/sve/fscale.c
>> new file mode 100644
>> index 00000000000..251b4ef9188
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/fscale.c
>> @@ -0,0 +1,16 @@
>> +/* { dg-do compile } */
>> +/* { dg-additional-options "-Ofast" } */
>> +
>> +float
>> +test_ldexpf (float x, int i)
>> +{
>> + return __builtin_ldexpf (x, i);
>> +}
>> +/* { dg-final { scan-assembler-times {\tfscale\tz[0-9]+\.s, p[0-7]/m,
>> z[0-9]+\.s, z[0-9]+\.s\n} 1 } } */
>> +
>> +double
>> +test_ldexp (double x, int i)
>> +{
>> + return __builtin_ldexp (x, i);
>> +}
>> +/* { dg-final { scan-assembler-times {\tfscale\tz[0-9]+\.d, p[0-7]/m,
>> z[0-9]+\.d, z[0-9]+\.d\n} 1 } } */