https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151
Andrew Pinski <pinskia at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Last reconfirmed| |2022-01-20 Status|UNCONFIRMED |NEW Summary|x86: excessive code |[9/10/11/12 Regression] |generated for 128-bit |x86: excessive code |byteswap |generated for 128-bit | |byteswap Ever confirmed|0 |1 Blocks| |101926 Target Milestone|--- |12.0 Known to work| |6.1.0 Component|target |middle-end Keywords| |missed-optimization --- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> --- GCC 11 and before does at -O2 (GCC 6 and before could it for -O3 too): mov rax, rsi mov rdx, rdi bswap rax bswap rdx The reason is SLP vectorizer is turned on for -O2 in GCC 12. We get: _11 = {_1, _2}; _5 = VIEW_CONVERT_EXPR<uint128_t>(_11); The expansion of this could be done using move instructions .... I notice for aarch64, SLP kicks in even more and does the following: fmov d0, x0 fmov v0.d[1], x1 ext v0.16b, v0.16b, v0.16b, #8 rev64 v0.16b, v0.16b umov x0, v0.d[0] umov x1, v0.d[1] This is even true for -O2 -mavx too: mov QWORD PTR [rsp-24], rdi mov QWORD PTR [rsp-16], rsi vmovdqa xmm1, XMMWORD PTR [rsp-24] vpalignr xmm0, xmm1, xmm1, 8 vpshufb xmm2, xmm0, XMMWORD PTR .LC0[rip] vmovdqa XMMWORD PTR [rsp-24], xmm2 mov rax, QWORD PTR [rsp-24] mov rdx, QWORD PTR [rsp-16] There are so many different little regressions when handling this code it seems. But I think it all comes down to modeling arguments and return value on the gimple level which breaks this. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101926 [Bug 101926] [meta-bug] struct/complex argument passing and return should be improved