https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|rtl-optimization            |target

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
clang can now produce:
        mov     eax, dword ptr [esp + 16]
        mov     ecx, dword ptr [esp + 28]
        vmovdqu xmm0, xmmword ptr [ecx + 32]
        vmovdqu xmm1, xmmword ptr [eax]
        vpackuswb       xmm2, xmm1, xmm0
        vpsubw  xmm0, xmm1, xmm0
        vpaddw  xmm0, xmm0, xmm2
        vpackuswb       xmm0, xmm0, xmm0
        vpackuswb       xmm0, xmm0, xmm0
        vpextrd eax, xmm0, 1
        ret

I suspect if the back-end is able to "fold" at the gimple level the builtins
into gimple, GCC will do a much better job.
Currently we have stuff like:
_27 = __builtin_ia32_vextractf128_si256 (_28, 0);
_26 = __builtin_ia32_vec_ext_v4si (_27, 1); [tail call]

I think both are just a BIT_FIELD_REF really and even more can be simplified to
just one bitfield extraction rather than what we do now:
        vpackuswb       %ymm1, %ymm0, %ymm0
        vpextrd $1, %xmm0, %eax

Plus it looks like with __builtin_ia32_vextractf128_si256 (_28, 0), clang is
able to remove half of the code due to only needing 128 bytes stuff :).

Reply via email to