Tamar Christina <tamar.christ...@arm.com> writes:
>> -----Original Message-----
>> From: Kyrylo Tkachov <ktkac...@nvidia.com>
>> Sent: Thursday, October 3, 2024 4:45 PM
>> To: Richard Sandiford <richard.sandif...@arm.com>
>> Cc: Soumya AR <soum...@nvidia.com>; Tamar Christina
>> <tamar.christ...@arm.com>; gcc-patches@gcc.gnu.org; Richard Earnshaw
>> <richard.earns...@arm.com>; Jennifer Schmitz <jschm...@nvidia.com>;
>> Pengxuan Zheng (QUIC) <quic_pzh...@quicinc.com>
>> Subject: Re: [PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE 
>> instruction
>>
>>
>>
>> > On 3 Oct 2024, at 16:41, Richard Sandiford <richard.sandif...@arm.com>
>> wrote:
>> >
>> > External email: Use caution opening links or attachments
>> >
>> >
>> > Soumya AR <soum...@nvidia.com> writes:
>> >> From 7fafcb5e0174c56205ec05406c9a412196ae93d3 Mon Sep 17 00:00:00
>> 2001
>> >> From: Soumya AR <soum...@nvidia.com>
>> >> Date: Thu, 3 Oct 2024 11:53:07 +0530
>> >> Subject: [PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE 
>> >> instruction
>> >>
>> >> This patch uses the FSCALE instruction provided by SVE to implement the
>> >> standard ldexp family of functions.
>> >>
>> >> Currently, with '-Ofast -mcpu=neoverse-v2', GCC generates libcalls for the
>> >> following code:
>> >>
>> >> float
>> >> test_ldexpf (float x, int i)
>> >> {
>> >>      return __builtin_ldexpf (x, i);
>> >> }
>> >>
>> >> double
>> >> test_ldexp (double x, int i)
>> >> {
>> >>      return __builtin_ldexp(x, i);
>> >> }
>> >>
>> >> GCC Output:
>> >>
>> >> test_ldexpf:
>> >>      b ldexpf
>> >>
>> >> test_ldexp:
>> >>      b ldexp
>> >>
>> >> Since SVE has support for an FSCALE instruction, we can use this to 
>> >> process
>> >> scalar floats by moving them to a vector register and performing an 
>> >> fscale call,
>> >> similar to how LLVM tackles an ldexp builtin as well.
>> >>
>> >> New Output:
>> >>
>> >> test_ldexpf:
>> >>      fmov s31, w0
>> >>      ptrue p7.b, all
>> >>      fscale z0.s, p7/m, z0.s, z31.s
>> >>      ret
>> >>
>> >> test_ldexp:
>> >>      sxtw x0, w0
>> >>      ptrue p7.b, all
>> >>      fmov d31, x0
>> >>      fscale z0.d, p7/m, z0.d, z31.d
>> >>      ret
>> >>
>> >> The patch was bootstrapped and regtested on aarch64-linux-gnu, no
>> regression.
>> >> OK for mainline?
>> >
>> > Could we also use the .H form for __builtin_ldexpf16?
>> >
>> > I suppose:
>> >
>> >> @@ -2286,7 +2289,8 @@
>> >>                       (VNx8DI "VNx2BI") (VNx8DF "VNx2BI")
>> >>                       (V8QI "VNx8BI") (V16QI "VNx16BI")
>> >>                       (V4HI "VNx4BI") (V8HI "VNx8BI") (V2SI "VNx2BI")
>> >> -                      (V4SI "VNx4BI") (V2DI "VNx2BI")])
>> >> +                      (V4SI "VNx4BI") (V2DI "VNx2BI")
>> >> +                      (SF "VNx4BI") (DF "VNx2BI")])
>> >
>> > ...this again raises the question what we should do for predicate
>> > modes when the data mode isn't a natural SVE mode.  That came up
>> > recently in relation to V1DI in the popcount patch, and for reductions
>> > in the ANDV etc. patch.
>>
>> Thanks you for enumerating the options below.
>>
>> >
>> > Three obvious options are:
>> >
>> > (1) Use the nearest SVE mode with a full ptrue (as the patch does).
>> > (2) Use the nearest SVE mode with a 128-bit ptrue.
>> > (3) Add new modes V16BI, V8BI, V4BI, V2BI, and V1BI.  (And possibly BI
>> >    for scalars.)
>>
>> Just to be clear, what do you mean by “nearest SVE mode” in this context?
>>
>
> I think he means the smallest SVE mode that has the same unit size as the 
> Adv. SIMD register.
> I think the idea is that we're consistent with the modes used so we don't end 
> up using e.g.
> VNx16QI and VNx8QI etc for e.g. b0.

Yeah.  I meant VNx16QI for anything with an inner mode of QI, etc.

>> > The problem with (1) is that, as Tamar pointed out, it doesn't work
>> > properly with reductions.  It also isn't safe for this patch (without
>> > fast-mathy options) because of FP exceptions.  Although writing to
>> > a scalar FP register zeros the upper bits, and so gives a "safe" value
>> > for this particular operation, nothing guarantees that all SF and DF
>> > values have this zero-extended form.  They could come from subregs of
>> > Advanced SIMD or SVE vectors.  The ABI also doesn't guarantee that
>> > incoming SF and DF values are zero-extended.
>> >
>> > (2) would be safe, but would mean that we continue to have an nunits
>> > disagreement between the data mode and the predicate mode.  This would
>> > prevent operations being described in generic RTL in future.
>> >
>> > (3) is probably the cleanest representional approach, but has the problem
>> > that we cannot store a fixed-length portion of an SVE predicate.
>> > We would have to load and store the modes via other register classes.
>> > (With PMOV we could use scalar FP loads and stores, but otherwise
>> > we'd probably need secondary memory reloads.)  That said, we could
>> > tell the RA to spill in a full predicate mode, so this shouldn't be
>> > a problem unless the modes somehow get exposed to gimple or frontends.
>> >
>> > WDYT?
>>
>> IMO option (2) sounds the more appealing at this stage. To me it feels
>> conceptually straightforward as we are using a SVE operation clamped at
>> 128 bits to “emulate” what should have been an 128-bit fixed-width mode
>> operation.
>> It also feels that, given the complexity of (3) and introducing new modes,
>> we should go for (3) only if/when we do decide to implement these ops with
>> generic RTL.
>
> 2 is already the case today isn't it? e.g. the 128-bit ptrue can be created 
> with
> intrinsics and is already common with the sve-neon intrinsics bridge.

(2) isn't what we currently do for Advanced SIMD modes, but yeah,
the intrastructure is already there to support it.

> Which also means that if we do go with 3, the modes do have to be exposed
> to gimple, since otherwise we'll have different modes between the intrinsics
> usage and those the compiler generates? Or am I misunderstanding a part.

I don't think it would affect the intrinsics.  Intrinsics like svptrue
always produce a full svbool_t, i.e. full VNx16BI.  Although the result
of a VL16 svptrue call only has a certain number of leading bits set,
that's a property of that particular value, rather than a property of
the type/mode.  The code that follows the svptrue could go on to set
later bits, without that being a change in the type/mode.

With (3), the modes would require that all bits beyond the first 8 or 16
are zero.  Anything that tries to set other bits would be invalid.

But I suppose that's another disadvantage of (3).  It would mean
having to maintain things like V16BI in a canonical form, a bit like
TRULY_NOOP_TRUNCATION for 32-bit values on MIPS.  And that's a wound
we should only self-inflict if we really need to.

So yeah...

> It seems to me that if we ever do describe them in RTL that the intrinsics 
> have
> to change too so we'd have to rethink the data model at that time already.
>
> FWIW, I agree that 2 seems the most sensible for now.. Though I do like 3 if
> we do want to start optimizing predicate usage in RTL.

...I agree that (2) is the best choice as things stand.

Thanks for the discussion.  Seems like we've reached consensus.

Richard

Reply via email to