> -----Original Message----- > From: Kyrylo Tkachov > Sent: Thursday, May 18, 2023 12:14 PM > To: gcc-patches@gcc.gnu.org > Cc: Richard Sandiford <richard.sandif...@arm.com> > Subject: [PATCH] aarch64: Implement vector FP absolute compare intrinsics > with builtins > > Hi all, > > While optimising some vector math library code with intrinsics we stumbled > upon the issue in the testcase. > The compiler should be generating a FACGT instruction but instead we > generate: > foo(__Float32x4_t, __Float32x4_t, __Float32x4_t): > fabs v0.4s, v0.4s > adrp x0, .LC0 > ldr q31, [x0, #:lo12:.LC0] > fcmgt v0.4s, v0.4s, v31.4s > ret > > This is because the vcagtq_f32 intrinsic is open-coded in arm_neon.h as > return vabsq_f32 (__a) > vabsq_f32 (__b) > thus relying on the optimisers to merge it back together. But since one of the > arms of the comparison > is a vector constant the combine pass optimises the abs into it and tries > matching: > (set (reg:V4SI 101) > (neg:V4SI (gt:V4SI (reg:V4SF 100) > (const_vector:V4SF [ > (const_double:SF 1.0e+2 [0x0.c8p+7]) repeated x4 > ])))) > and > (set (reg:V4SI 101) > (neg:V4SI (gt:V4SI (abs:V4SF (reg:V4SF 104)) > (reg:V4SF 103)))) > > instead of what we want: > (insn 13 9 14 2 (set (reg/i:V4SI 32 v0) > (neg:V4SI (gt:V4SI (abs:V4SF (reg:V4SF 98)) > (abs:V4SF (reg:V4SF 96))))) > > I don't really see a good way around that with our current implementation of > these intrinsics. > Therefore this patch reimplements these intrinsics with aarch64 builtins that > generate the RTL for these > instructions directly. Apparently we already had them defined in aarch64- > simd-builtins.def and have been > using them for the fp16 case already. > I realise that this approach is against the general principle of expressing > intrinsics in the higher-level constructs, > so I'm willing to listen to counter-arguments. > That said, the FACGT/FACGE instructions are as fast as the non-ABS > comparison instructions on all microarchitectures that I know of > so it should always be a win to have them in the merged form rather than > split the fabs step separately or try to hoist it. > And the testcase does come from real library code that we're trying to > optimise. > With this patch for the testcase we generate: > foo: > adrp x0, .LC0 > ldr q31, [x0, #:lo12:.LC0] > facgt v0.4s, v0.4s, v31.4s > ret > > Bootstrapped and tested on aarch64-none-linux-gnu. > I'll hold off on committing this to give folks a few days to comment, but will > push by the end of next week if there are no objections. Pushed to trunk. Thanks, Kyrill > > Thanks, > Kyrill > > gcc/ChangeLog: > > * config/aarch64/arm_neon.h (vcage_f64): Reimplement with > builtins. > (vcage_f32): Likewise. > (vcages_f32): Likewise. > (vcageq_f32): Likewise. > (vcaged_f64): Likewise. > (vcageq_f64): Likewise. > (vcagts_f32): Likewise. > (vcagt_f32): Likewise. > (vcagt_f64): Likewise. > (vcagtq_f32): Likewise. > (vcagtd_f64): Likewise. > (vcagtq_f64): Likewise. > (vcale_f32): Likewise. > (vcale_f64): Likewise. > (vcaled_f64): Likewise. > (vcales_f32): Likewise. > (vcaleq_f32): Likewise. > (vcaleq_f64): Likewise. > (vcalt_f32): Likewise. > (vcalt_f64): Likewise. > (vcaltd_f64): Likewise. > (vcaltq_f32): Likewise. > (vcaltq_f64): Likewise. > (vcalts_f32): Likewise. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/simd/facgt_constpool_1.c: New test.
RE: [PATCH] aarch64: Implement vector FP absolute compare intrinsics with builtins
Kyrylo Tkachov via Gcc-patches Thu, 25 May 2023 01:50:22 -0700
- [PATCH] aarch64: Implement vector FP absolu... Kyrylo Tkachov via Gcc-patches
- RE: [PATCH] aarch64: Implement vector ... Kyrylo Tkachov via Gcc-patches