https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63791
Marcus Kool <marcus.kool at urlfilterdb dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |missed-optimization Summary|use 32-byte version of |use 32-byte version of |vpbroadcastb on AVX2 |vpbroadcastb (and register |platform |to poulate) on AVX/AVX2 | |platforms Known to fail| |4.8.4, 4.9.2, 5.1.0 Severity|minor |normal --- Comment #2 from Marcus Kool <marcus.kool at urlfilterdb dot com> --- After the comment of Jakub I waited for the release of gcc 5.1.0 but performance of programs that use *_set1_epi8() got 6% worse because gcc 5.1.0 now uses vpbroadcastb in the intended way but to populate the ymm register it uses slow memory instead of a register. This is what 5.1.0 generates: movl %edi, -20(%rbp) vpbroadcastb -20(%rbp), %ymm0 while this is optimal: vmovd %edi, %xmm0 vpbroadcastb %xmm0, %ymm0 Also for the AVX platform (see attachment avx.c) gcc 5.1.0 also uses memory and many instructions to populate an xmm register: movl %edi, -12(%rsp) vpxor %xmm1, %xmm1, %xmm1 vmovd -12(%rsp), %xmm0 xorl %eax, %eax vpshufb %xmm1, %xmm0, %xmm0 where vmovd %edi, %xmm0 vpbroadcastb %xmm0, %xmm0 is optimal. To resume, gcc 4.8.4 and gcc 4.9.2 produce code that can be optimised further, and gcc 5.1.0 produces even slower code which means that the implementation of *_set1_epi8() is slower/much-slower than that it can be.