Re: [PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE instruction

Richard Sandiford Tue, 08 Oct 2024 02:45:25 -0700

Tamar Christina <tamar.christ...@arm.com> writes:
>> -----Original Message-----
>> From: Kyrylo Tkachov <ktkac...@nvidia.com>
>> Sent: Thursday, October 3, 2024 4:45 PM
>> To: Richard Sandiford <richard.sandif...@arm.com>
>> Cc: Soumya AR <soum...@nvidia.com>; Tamar Christina
>> <tamar.christ...@arm.com>; gcc-patches@gcc.gnu.org; Richard Earnshaw
>> <richard.earns...@arm.com>; Jennifer Schmitz <jschm...@nvidia.com>;
>> Pengxuan Zheng (QUIC) <quic_pzh...@quicinc.com>
>> Subject: Re: [PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE 
>> instruction
>>
>>
>>
>> > On 3 Oct 2024, at 16:41, Richard Sandiford <richard.sandif...@arm.com>
>> wrote:
>> >
>> > External email: Use caution opening links or attachments
>> >
>> >
>> > Soumya AR <soum...@nvidia.com> writes:
>> >> From 7fafcb5e0174c56205ec05406c9a412196ae93d3 Mon Sep 17 00:00:00
>> 2001
>> >> From: Soumya AR <soum...@nvidia.com>
>> >> Date: Thu, 3 Oct 2024 11:53:07 +0530
>> >> Subject: [PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE 
>> >> instruction
>> >>
>> >> This patch uses the FSCALE instruction provided by SVE to implement the
>> >> standard ldexp family of functions.
>> >>
>> >> Currently, with '-Ofast -mcpu=neoverse-v2', GCC generates libcalls for the
>> >> following code:
>> >>
>> >> float
>> >> test_ldexpf (float x, int i)
>> >> {
>> >>      return __builtin_ldexpf (x, i);
>> >> }
>> >>
>> >> double
>> >> test_ldexp (double x, int i)
>> >> {
>> >>      return __builtin_ldexp(x, i);
>> >> }
>> >>
>> >> GCC Output:
>> >>
>> >> test_ldexpf:
>> >>      b ldexpf
>> >>
>> >> test_ldexp:
>> >>      b ldexp
>> >>
>> >> Since SVE has support for an FSCALE instruction, we can use this to 
>> >> process
>> >> scalar floats by moving them to a vector register and performing an 
>> >> fscale call,
>> >> similar to how LLVM tackles an ldexp builtin as well.
>> >>
>> >> New Output:
>> >>
>> >> test_ldexpf:
>> >>      fmov s31, w0
>> >>      ptrue p7.b, all
>> >>      fscale z0.s, p7/m, z0.s, z31.s
>> >>      ret
>> >>
>> >> test_ldexp:
>> >>      sxtw x0, w0
>> >>      ptrue p7.b, all
>> >>      fmov d31, x0
>> >>      fscale z0.d, p7/m, z0.d, z31.d
>> >>      ret
>> >>
>> >> The patch was bootstrapped and regtested on aarch64-linux-gnu, no
>> regression.
>> >> OK for mainline?
>> >
>> > Could we also use the .H form for __builtin_ldexpf16?
>> >
>> > I suppose:
>> >
>> >> @@ -2286,7 +2289,8 @@
>> >>                       (VNx8DI "VNx2BI") (VNx8DF "VNx2BI")
>> >>                       (V8QI "VNx8BI") (V16QI "VNx16BI")
>> >>                       (V4HI "VNx4BI") (V8HI "VNx8BI") (V2SI "VNx2BI")
>> >> -                      (V4SI "VNx4BI") (V2DI "VNx2BI")])
>> >> +                      (V4SI "VNx4BI") (V2DI "VNx2BI")
>> >> +                      (SF "VNx4BI") (DF "VNx2BI")])
>> >
>> > ...this again raises the question what we should do for predicate
>> > modes when the data mode isn't a natural SVE mode.  That came up
>> > recently in relation to V1DI in the popcount patch, and for reductions
>> > in the ANDV etc. patch.
>>
>> Thanks you for enumerating the options below.
>>
>> >
>> > Three obvious options are:
>> >
>> > (1) Use the nearest SVE mode with a full ptrue (as the patch does).
>> > (2) Use the nearest SVE mode with a 128-bit ptrue.
>> > (3) Add new modes V16BI, V8BI, V4BI, V2BI, and V1BI.  (And possibly BI
>> >    for scalars.)
>>
>> Just to be clear, what do you mean by “nearest SVE mode” in this context?
>>
>
> I think he means the smallest SVE mode that has the same unit size as the 
> Adv. SIMD register.
> I think the idea is that we're consistent with the modes used so we don't end 
> up using e.g.
> VNx16QI and VNx8QI etc for e.g. b0.


Yeah.  I meant VNx16QI for anything with an inner mode of QI, etc.

>> > The problem with (1) is that, as Tamar pointed out, it doesn't work
>> > properly with reductions.  It also isn't safe for this patch (without
>> > fast-mathy options) because of FP exceptions.  Although writing to
>> > a scalar FP register zeros the upper bits, and so gives a "safe" value
>> > for this particular operation, nothing guarantees that all SF and DF
>> > values have this zero-extended form.  They could come from subregs of
>> > Advanced SIMD or SVE vectors.  The ABI also doesn't guarantee that
>> > incoming SF and DF values are zero-extended.
>> >
>> > (2) would be safe, but would mean that we continue to have an nunits
>> > disagreement between the data mode and the predicate mode.  This would
>> > prevent operations being described in generic RTL in future.
>> >
>> > (3) is probably the cleanest representional approach, but has the problem
>> > that we cannot store a fixed-length portion of an SVE predicate.
>> > We would have to load and store the modes via other register classes.
>> > (With PMOV we could use scalar FP loads and stores, but otherwise
>> > we'd probably need secondary memory reloads.)  That said, we could
>> > tell the RA to spill in a full predicate mode, so this shouldn't be
>> > a problem unless the modes somehow get exposed to gimple or frontends.
>> >
>> > WDYT?
>>
>> IMO option (2) sounds the more appealing at this stage. To me it feels
>> conceptually straightforward as we are using a SVE operation clamped at
>> 128 bits to “emulate” what should have been an 128-bit fixed-width mode
>> operation.
>> It also feels that, given the complexity of (3) and introducing new modes,
>> we should go for (3) only if/when we do decide to implement these ops with
>> generic RTL.
>
> 2 is already the case today isn't it? e.g. the 128-bit ptrue can be created 
> with
> intrinsics and is already common with the sve-neon intrinsics bridge.

(2) isn't what we currently do for Advanced SIMD modes, but yeah,
the intrastructure is already there to support it.

> Which also means that if we do go with 3, the modes do have to be exposed
> to gimple, since otherwise we'll have different modes between the intrinsics
> usage and those the compiler generates? Or am I misunderstanding a part.

I don't think it would affect the intrinsics.  Intrinsics like svptrue
always produce a full svbool_t, i.e. full VNx16BI.  Although the result
of a VL16 svptrue call only has a certain number of leading bits set,
that's a property of that particular value, rather than a property of
the type/mode.  The code that follows the svptrue could go on to set
later bits, without that being a change in the type/mode.

With (3), the modes would require that all bits beyond the first 8 or 16
are zero.  Anything that tries to set other bits would be invalid.

But I suppose that's another disadvantage of (3).  It would mean
having to maintain things like V16BI in a canonical form, a bit like
TRULY_NOOP_TRUNCATION for 32-bit values on MIPS.  And that's a wound
we should only self-inflict if we really need to.

So yeah...

> It seems to me that if we ever do describe them in RTL that the intrinsics 
> have
> to change too so we'd have to rethink the data model at that time already.
>
> FWIW, I agree that 2 seems the most sensible for now.. Though I do like 3 if
> we do want to start optimizing predicate usage in RTL.

...I agree that (2) is the best choice as things stand.

Thanks for the discussion.  Seems like we've reached consensus.

Richard

Re: [PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE instruction

Reply via email to