https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287
--- Comment #7 from Hongtao.liu <crazylht at gmail dot com> --- (In reply to Andrew Pinski from comment #5) > clang can now produce: > mov eax, dword ptr [esp + 16] > mov ecx, dword ptr [esp + 28] > vmovdqu xmm0, xmmword ptr [ecx + 32] > vmovdqu xmm1, xmmword ptr [eax] > vpackuswb xmm2, xmm1, xmm0 > vpsubw xmm0, xmm1, xmm0 > vpaddw xmm0, xmm0, xmm2 > vpackuswb xmm0, xmm0, xmm0 > vpackuswb xmm0, xmm0, xmm0 > vpextrd eax, xmm0, 1 > ret > > I suspect if the back-end is able to "fold" at the gimple level the builtins > into gimple, GCC will do a much better job. > Currently we have stuff like: > _27 = __builtin_ia32_vextractf128_si256 (_28, 0); > _26 = __builtin_ia32_vec_ext_v4si (_27, 1); [tail call] > > I think both are just a BIT_FIELD_REF really and even more can be simplified > to just one bitfield extraction rather than what we do now: > vpackuswb %ymm1, %ymm0, %ymm0 > vpextrd $1, %xmm0, %eax > > Plus it looks like with __builtin_ia32_vextractf128_si256 (_28, 0), clang is > able to remove half of the code due to only needing 128 bytes stuff :). Yes, let's me try this.