https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287
Andrew Pinski <pinskia at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Component|rtl-optimization |target --- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> --- clang can now produce: mov eax, dword ptr [esp + 16] mov ecx, dword ptr [esp + 28] vmovdqu xmm0, xmmword ptr [ecx + 32] vmovdqu xmm1, xmmword ptr [eax] vpackuswb xmm2, xmm1, xmm0 vpsubw xmm0, xmm1, xmm0 vpaddw xmm0, xmm0, xmm2 vpackuswb xmm0, xmm0, xmm0 vpackuswb xmm0, xmm0, xmm0 vpextrd eax, xmm0, 1 ret I suspect if the back-end is able to "fold" at the gimple level the builtins into gimple, GCC will do a much better job. Currently we have stuff like: _27 = __builtin_ia32_vextractf128_si256 (_28, 0); _26 = __builtin_ia32_vec_ext_v4si (_27, 1); [tail call] I think both are just a BIT_FIELD_REF really and even more can be simplified to just one bitfield extraction rather than what we do now: vpackuswb %ymm1, %ymm0, %ymm0 vpextrd $1, %xmm0, %eax Plus it looks like with __builtin_ia32_vextractf128_si256 (_28, 0), clang is able to remove half of the code due to only needing 128 bytes stuff :).