12 Regression] x86: excessive code generated for 128-bit byteswap

pinskia at gcc dot gnu.org via Gcc-bugs Thu, 20 Jan 2022 15:41:32 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151


Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2022-01-20
             Status|UNCONFIRMED                 |NEW
            Summary|x86: excessive code         |[9/10/11/12 Regression]
                   |generated for 128-bit       |x86: excessive code
                   |byteswap                    |generated for 128-bit
                   |                            |byteswap
     Ever confirmed|0                           |1
             Blocks|                            |101926
   Target Milestone|---                         |12.0
      Known to work|                            |6.1.0
          Component|target                      |middle-end
           Keywords|                            |missed-optimization

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
GCC 11 and before does at -O2 (GCC 6 and before could it for -O3 too):

        mov     rax, rsi
        mov     rdx, rdi
        bswap   rax
        bswap   rdx

The reason is SLP vectorizer is turned on for -O2 in GCC 12.
We get:

  _11 = {_1, _2};
  _5 = VIEW_CONVERT_EXPR<uint128_t>(_11);


The expansion of this could be done using move instructions ....

I notice for aarch64, SLP kicks in even more and does the following:

        fmov    d0, x0
        fmov    v0.d[1], x1
        ext     v0.16b, v0.16b, v0.16b, #8
        rev64   v0.16b, v0.16b
        umov    x0, v0.d[0]
        umov    x1, v0.d[1]

This is even true for -O2 -mavx too:

        mov     QWORD PTR [rsp-24], rdi
        mov     QWORD PTR [rsp-16], rsi
        vmovdqa xmm1, XMMWORD PTR [rsp-24]
        vpalignr        xmm0, xmm1, xmm1, 8
        vpshufb xmm2, xmm0, XMMWORD PTR .LC0[rip]
        vmovdqa XMMWORD PTR [rsp-24], xmm2
        mov     rax, QWORD PTR [rsp-24]
        mov     rdx, QWORD PTR [rsp-16]

There are so many different little regressions when handling this code it
seems.
But I think it all comes down to modeling arguments and return value on the
gimple level which breaks this.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101926
[Bug 101926] [meta-bug] struct/complex argument passing and return should be
improved

[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap

Reply via email to