Tamar Christina <tamar.christ...@arm.com> writes: >> -----Original Message----- >> From: Kyrylo Tkachov <ktkac...@nvidia.com> >> Sent: Thursday, October 3, 2024 4:45 PM >> To: Richard Sandiford <richard.sandif...@arm.com> >> Cc: Soumya AR <soum...@nvidia.com>; Tamar Christina >> <tamar.christ...@arm.com>; gcc-patches@gcc.gnu.org; Richard Earnshaw >> <richard.earns...@arm.com>; Jennifer Schmitz <jschm...@nvidia.com>; >> Pengxuan Zheng (QUIC) <quic_pzh...@quicinc.com> >> Subject: Re: [PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE >> instruction >> >> >> >> > On 3 Oct 2024, at 16:41, Richard Sandiford <richard.sandif...@arm.com> >> wrote: >> > >> > External email: Use caution opening links or attachments >> > >> > >> > Soumya AR <soum...@nvidia.com> writes: >> >> From 7fafcb5e0174c56205ec05406c9a412196ae93d3 Mon Sep 17 00:00:00 >> 2001 >> >> From: Soumya AR <soum...@nvidia.com> >> >> Date: Thu, 3 Oct 2024 11:53:07 +0530 >> >> Subject: [PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE >> >> instruction >> >> >> >> This patch uses the FSCALE instruction provided by SVE to implement the >> >> standard ldexp family of functions. >> >> >> >> Currently, with '-Ofast -mcpu=neoverse-v2', GCC generates libcalls for the >> >> following code: >> >> >> >> float >> >> test_ldexpf (float x, int i) >> >> { >> >> return __builtin_ldexpf (x, i); >> >> } >> >> >> >> double >> >> test_ldexp (double x, int i) >> >> { >> >> return __builtin_ldexp(x, i); >> >> } >> >> >> >> GCC Output: >> >> >> >> test_ldexpf: >> >> b ldexpf >> >> >> >> test_ldexp: >> >> b ldexp >> >> >> >> Since SVE has support for an FSCALE instruction, we can use this to >> >> process >> >> scalar floats by moving them to a vector register and performing an >> >> fscale call, >> >> similar to how LLVM tackles an ldexp builtin as well. >> >> >> >> New Output: >> >> >> >> test_ldexpf: >> >> fmov s31, w0 >> >> ptrue p7.b, all >> >> fscale z0.s, p7/m, z0.s, z31.s >> >> ret >> >> >> >> test_ldexp: >> >> sxtw x0, w0 >> >> ptrue p7.b, all >> >> fmov d31, x0 >> >> fscale z0.d, p7/m, z0.d, z31.d >> >> ret >> >> >> >> The patch was bootstrapped and regtested on aarch64-linux-gnu, no >> regression. >> >> OK for mainline? >> > >> > Could we also use the .H form for __builtin_ldexpf16? >> > >> > I suppose: >> > >> >> @@ -2286,7 +2289,8 @@ >> >> (VNx8DI "VNx2BI") (VNx8DF "VNx2BI") >> >> (V8QI "VNx8BI") (V16QI "VNx16BI") >> >> (V4HI "VNx4BI") (V8HI "VNx8BI") (V2SI "VNx2BI") >> >> - (V4SI "VNx4BI") (V2DI "VNx2BI")]) >> >> + (V4SI "VNx4BI") (V2DI "VNx2BI") >> >> + (SF "VNx4BI") (DF "VNx2BI")]) >> > >> > ...this again raises the question what we should do for predicate >> > modes when the data mode isn't a natural SVE mode. That came up >> > recently in relation to V1DI in the popcount patch, and for reductions >> > in the ANDV etc. patch. >> >> Thanks you for enumerating the options below. >> >> > >> > Three obvious options are: >> > >> > (1) Use the nearest SVE mode with a full ptrue (as the patch does). >> > (2) Use the nearest SVE mode with a 128-bit ptrue. >> > (3) Add new modes V16BI, V8BI, V4BI, V2BI, and V1BI. (And possibly BI >> > for scalars.) >> >> Just to be clear, what do you mean by “nearest SVE mode” in this context? >> > > I think he means the smallest SVE mode that has the same unit size as the > Adv. SIMD register. > I think the idea is that we're consistent with the modes used so we don't end > up using e.g. > VNx16QI and VNx8QI etc for e.g. b0.
Yeah. I meant VNx16QI for anything with an inner mode of QI, etc. >> > The problem with (1) is that, as Tamar pointed out, it doesn't work >> > properly with reductions. It also isn't safe for this patch (without >> > fast-mathy options) because of FP exceptions. Although writing to >> > a scalar FP register zeros the upper bits, and so gives a "safe" value >> > for this particular operation, nothing guarantees that all SF and DF >> > values have this zero-extended form. They could come from subregs of >> > Advanced SIMD or SVE vectors. The ABI also doesn't guarantee that >> > incoming SF and DF values are zero-extended. >> > >> > (2) would be safe, but would mean that we continue to have an nunits >> > disagreement between the data mode and the predicate mode. This would >> > prevent operations being described in generic RTL in future. >> > >> > (3) is probably the cleanest representional approach, but has the problem >> > that we cannot store a fixed-length portion of an SVE predicate. >> > We would have to load and store the modes via other register classes. >> > (With PMOV we could use scalar FP loads and stores, but otherwise >> > we'd probably need secondary memory reloads.) That said, we could >> > tell the RA to spill in a full predicate mode, so this shouldn't be >> > a problem unless the modes somehow get exposed to gimple or frontends. >> > >> > WDYT? >> >> IMO option (2) sounds the more appealing at this stage. To me it feels >> conceptually straightforward as we are using a SVE operation clamped at >> 128 bits to “emulate” what should have been an 128-bit fixed-width mode >> operation. >> It also feels that, given the complexity of (3) and introducing new modes, >> we should go for (3) only if/when we do decide to implement these ops with >> generic RTL. > > 2 is already the case today isn't it? e.g. the 128-bit ptrue can be created > with > intrinsics and is already common with the sve-neon intrinsics bridge. (2) isn't what we currently do for Advanced SIMD modes, but yeah, the intrastructure is already there to support it. > Which also means that if we do go with 3, the modes do have to be exposed > to gimple, since otherwise we'll have different modes between the intrinsics > usage and those the compiler generates? Or am I misunderstanding a part. I don't think it would affect the intrinsics. Intrinsics like svptrue always produce a full svbool_t, i.e. full VNx16BI. Although the result of a VL16 svptrue call only has a certain number of leading bits set, that's a property of that particular value, rather than a property of the type/mode. The code that follows the svptrue could go on to set later bits, without that being a change in the type/mode. With (3), the modes would require that all bits beyond the first 8 or 16 are zero. Anything that tries to set other bits would be invalid. But I suppose that's another disadvantage of (3). It would mean having to maintain things like V16BI in a canonical form, a bit like TRULY_NOOP_TRUNCATION for 32-bit values on MIPS. And that's a wound we should only self-inflict if we really need to. So yeah... > It seems to me that if we ever do describe them in RTL that the intrinsics > have > to change too so we'd have to rethink the data model at that time already. > > FWIW, I agree that 2 seems the most sensible for now.. Though I do like 3 if > we do want to start optimizing predicate usage in RTL. ...I agree that (2) is the best choice as things stand. Thanks for the discussion. Seems like we've reached consensus. Richard