https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105354
Bug ID: 105354 Summary: __builtin_shuffle for alignr generates suboptimal code unless SSSE3 is enabled Product: gcc Version: 11.2.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: john_platts at hotmail dot com Target Milestone: --- The below code generates suboptimal code if SSE2 is enabled but SSSE3 is not enabled: #include <cstdint> typedef std::uint8_t Simd128U8VectT __attribute__((__vector_size__(16))); template<int RotateAmt> static inline Simd128U8VectT RotateRightByByteAmt(Simd128U8VectT vect) noexcept { constexpr int NormalizedRotateAmt = RotateAmt & 15; if constexpr(NormalizedRotateAmt == 0) return vect; else return __builtin_shuffle(vect, vect, (Simd128U8VectT){ NormalizedRotateAmt, NormalizedRotateAmt + 1, NormalizedRotateAmt + 2, NormalizedRotateAmt + 3, NormalizedRotateAmt + 4, NormalizedRotateAmt + 5, NormalizedRotateAmt + 6, NormalizedRotateAmt + 7, NormalizedRotateAmt + 8, NormalizedRotateAmt + 9, NormalizedRotateAmt + 10, NormalizedRotateAmt + 11, NormalizedRotateAmt + 12, NormalizedRotateAmt + 13, NormalizedRotateAmt + 14, NormalizedRotateAmt + 15 }); } auto func1(Simd128U8VectT vect) noexcept { return RotateRightByByteAmt<5>(vect); } Here is the code that is generated on GCC 11 if the -O2 -mssse3 options are specified: func1(unsigned char __vector(16)): palignr xmm0, xmm0, 5 ret Here is the code that is generated on GCC 11 if the -O2 option is specified but the -mssse3 option is not specified on 64-bit x86 platforms: func1(unsigned char __vector(16)): sub rsp, 144 movd ecx, xmm0 movaps XMMWORD PTR [rsp+8], xmm0 movzx edx, BYTE PTR [rsp+20] movzx ecx, cl movaps XMMWORD PTR [rsp+24], xmm0 movzx eax, BYTE PTR [rsp+35] sal rdx, 8 movaps XMMWORD PTR [rsp+40], xmm0 or rdx, rax movzx eax, BYTE PTR [rsp+50] movaps XMMWORD PTR [rsp+56], xmm0 sal rdx, 8 movaps XMMWORD PTR [rsp+72], xmm0 or rdx, rax movzx eax, BYTE PTR [rsp+65] movaps XMMWORD PTR [rsp+88], xmm0 sal rdx, 8 movaps XMMWORD PTR [rsp+104], xmm0 or rdx, rax movzx eax, BYTE PTR [rsp+80] movaps XMMWORD PTR [rsp-104], xmm0 sal rdx, 8 movaps XMMWORD PTR [rsp-88], xmm0 movzx edi, BYTE PTR [rsp-85] or rdx, rax movzx eax, BYTE PTR [rsp+95] movaps XMMWORD PTR [rsp-72], xmm0 sal rdx, 8 movaps XMMWORD PTR [rsp-56], xmm0 or rdx, rax movzx eax, BYTE PTR [rsp+110] movaps XMMWORD PTR [rsp-40], xmm0 sal rdx, 8 movaps XMMWORD PTR [rsp-24], xmm0 or rdx, rax movzx eax, BYTE PTR [rsp-100] movaps XMMWORD PTR [rsp+120], xmm0 movzx esi, BYTE PTR [rsp+125] movaps XMMWORD PTR [rsp-8], xmm0 sal rdx, 8 sal rax, 8 or rdx, rsi or rax, rdi movzx edi, BYTE PTR [rsp-70] sal rax, 8 or rax, rdi movzx edi, BYTE PTR [rsp-55] sal rax, 8 or rax, rdi sal rax, 8 or rax, rcx movzx ecx, BYTE PTR [rsp-25] sal rax, 8 or rax, rcx movzx ecx, BYTE PTR [rsp-10] sal rax, 8 or rax, rcx movzx ecx, BYTE PTR [rsp+5] mov QWORD PTR [rsp-120], rdx sal rax, 8 or rax, rcx mov QWORD PTR [rsp-112], rax movdqa xmm0, XMMWORD PTR [rsp-120] add rsp, 144 ret Here is a more optimal implementation of the above code on 64-bit x86 platforms when SSE2 is enabled but SSSE3 is not enabled: func1(unsigned char __vector(16)): movdqa xmm1, xmm0 psrldq xmm1, 5 pslldq xmm0, 11 por xmm0, xmm1 ret