> -----Original Message-----
> From: Kyrylo Tkachov
> Sent: Thursday, May 18, 2023 12:14 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Richard Sandiford <richard.sandif...@arm.com>
> Subject: [PATCH] aarch64: Implement vector FP absolute compare intrinsics
> with builtins
> 
> Hi all,
> 
> While optimising some vector math library code with intrinsics we stumbled
> upon the issue in the testcase.
> The compiler should be generating a FACGT instruction but instead we
> generate:
> foo(__Float32x4_t, __Float32x4_t, __Float32x4_t):
>         fabs    v0.4s, v0.4s
>         adrp    x0, .LC0
>         ldr     q31, [x0, #:lo12:.LC0]
>         fcmgt   v0.4s, v0.4s, v31.4s
>         ret
> 
> This is because the vcagtq_f32 intrinsic is open-coded in arm_neon.h as
> return vabsq_f32 (__a) > vabsq_f32 (__b)
> thus relying on the optimisers to merge it back together. But since one of the
> arms of the comparison
> is a vector constant the combine pass optimises the abs into it and tries
> matching:
> (set (reg:V4SI 101)
>     (neg:V4SI (gt:V4SI (reg:V4SF 100)
>             (const_vector:V4SF [
>                     (const_double:SF 1.0e+2 [0x0.c8p+7]) repeated x4
>                 ]))))
> and
> (set (reg:V4SI 101)
>     (neg:V4SI (gt:V4SI (abs:V4SF (reg:V4SF 104))
>             (reg:V4SF 103))))
> 
> instead of what we want:
> (insn 13 9 14 2 (set (reg/i:V4SI 32 v0)
>         (neg:V4SI (gt:V4SI (abs:V4SF (reg:V4SF 98))
>                 (abs:V4SF (reg:V4SF 96)))))
> 
> I don't really see a good way around that with our current implementation of
> these intrinsics.
> Therefore this patch reimplements these intrinsics with aarch64 builtins that
> generate the RTL for these
> instructions directly. Apparently we already had them defined in aarch64-
> simd-builtins.def and have been
> using them for the fp16 case already.
> I realise that this approach is against the general principle of expressing
> intrinsics in the higher-level constructs,
> so I'm willing to listen to counter-arguments.
> That said, the FACGT/FACGE instructions are as fast as the non-ABS
> comparison instructions on all microarchitectures that I know of
> so it should always be a win to have them in the merged form rather than
> split the fabs step separately or try to hoist it.
> And the testcase does come from real library code that we're trying to
> optimise.
> With this patch for the testcase we generate:
> foo:
>         adrp    x0, .LC0
>         ldr     q31, [x0, #:lo12:.LC0]
>         facgt   v0.4s, v0.4s, v31.4s
>         ret
> 
> Bootstrapped and tested on aarch64-none-linux-gnu.
> I'll hold off on committing this to give folks a few days to comment, but will
> push by the end of next week if there are no objections.

Pushed to trunk.
Thanks,
Kyrill

> 
> Thanks,
> Kyrill
> 
> gcc/ChangeLog:
> 
>       * config/aarch64/arm_neon.h (vcage_f64): Reimplement with
> builtins.
>       (vcage_f32): Likewise.
>       (vcages_f32): Likewise.
>       (vcageq_f32): Likewise.
>       (vcaged_f64): Likewise.
>       (vcageq_f64): Likewise.
>       (vcagts_f32): Likewise.
>       (vcagt_f32): Likewise.
>       (vcagt_f64): Likewise.
>       (vcagtq_f32): Likewise.
>       (vcagtd_f64): Likewise.
>       (vcagtq_f64): Likewise.
>       (vcale_f32): Likewise.
>       (vcale_f64): Likewise.
>       (vcaled_f64): Likewise.
>       (vcales_f32): Likewise.
>       (vcaleq_f32): Likewise.
>       (vcaleq_f64): Likewise.
>       (vcalt_f32): Likewise.
>       (vcalt_f64): Likewise.
>       (vcaltd_f64): Likewise.
>       (vcaltq_f32): Likewise.
>       (vcaltq_f64): Likewise.
>       (vcalts_f32): Likewise.
> 
> gcc/testsuite/ChangeLog:
> 
>       * gcc.target/aarch64/simd/facgt_constpool_1.c: New test.

Reply via email to