https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287

--- Comment #7 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Andrew Pinski from comment #5)
> clang can now produce:
>         mov     eax, dword ptr [esp + 16]
>         mov     ecx, dword ptr [esp + 28]
>         vmovdqu xmm0, xmmword ptr [ecx + 32]
>         vmovdqu xmm1, xmmword ptr [eax]
>         vpackuswb       xmm2, xmm1, xmm0
>         vpsubw  xmm0, xmm1, xmm0
>         vpaddw  xmm0, xmm0, xmm2
>         vpackuswb       xmm0, xmm0, xmm0
>         vpackuswb       xmm0, xmm0, xmm0
>         vpextrd eax, xmm0, 1
>         ret
> 
> I suspect if the back-end is able to "fold" at the gimple level the builtins
> into gimple, GCC will do a much better job.
> Currently we have stuff like:
> _27 = __builtin_ia32_vextractf128_si256 (_28, 0);
> _26 = __builtin_ia32_vec_ext_v4si (_27, 1); [tail call]
> 
> I think both are just a BIT_FIELD_REF really and even more can be simplified
> to just one bitfield extraction rather than what we do now:
>         vpackuswb       %ymm1, %ymm0, %ymm0
>         vpextrd $1, %xmm0, %eax
> 
> Plus it looks like with __builtin_ia32_vextractf128_si256 (_28, 0), clang is
> able to remove half of the code due to only needing 128 bytes stuff :).

Yes, let's me try this.

Reply via email to