12 Regression] x86: excessive code generated for 128-bit byteswap

rsandifo at gcc dot gnu.org via Gcc-bugs Fri, 21 Jan 2022 01:12:02 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151


--- Comment #7 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> 
---
(In reply to Richard Biener from comment #6)
> Richard - I'm sure we can construct a similar case for aarch64 where
> argument passing and vector mode use cause spilling?
> 
> On x86 the simplest testcase showing this is
> 
> typedef unsigned long long v2di __attribute__((vector_size(16)));
> v2di bswap(__uint128_t a)
> {
>     return *(v2di *)&a;
> }
> 
> that produces
> 
> bswap:
> .LFB0:
>         .cfi_startproc
>         sub     sp, sp, #16
>         .cfi_def_cfa_offset 16
>         stp     x0, x1, [sp]
>         ldr     q0, [sp]
>         add     sp, sp, 16
>         .cfi_def_cfa_offset 0
>         ret
> 
> on arm for me.  Maybe the stp x0, x1 store can forward to the ldr load
> though and I'm not sure there's another way to move x0/x1 to q0.
It looks like this is a deliberate choice for aarch64.  The generic
costing has:

  /* Avoid the use of slow int<->fp moves for spilling by setting
     their cost higher than memmov_cost.  */
  5, /* GP2FP  */

So in cases like the above, we're telling IRA that spilling to
memory and reloading is cheaper than moving between registers.
For -mtune=thunderx we generate:

        fmov    d0, x0
        ins     v0.d[1], x1
        ret

instead.

[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap

Reply via email to