https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102392
Gabriel Ravier <gabravier at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- Target|X86_64-linux-gnu |x86_64-linux-gnu Version|12.0 |15.0 --- Comment #5 from Gabriel Ravier <gabravier at gmail dot com> --- I've wound up stumbling upon a very similar bug (which I think is the same bug at its core) while examining the following code: static uint32_t f(int8_t x) { return (~(uint32_t)x) & 1; } void floop(uint32_t *r, int8_t *x, size_t n) { #ifndef __clang__ _Pragma("GCC unroll 0") _Pragma("GCC novector") #else _Pragma("clang loop unroll(disable) vectorize(disable)") #endif for (size_t i = 0; i < n; ++i) r[i] = f(x[i]); } where for the loop, GCC generates: .L3: movsx eax, BYTE PTR [rsi+rdx] # <--- sign extension not eax and eax, 1 mov DWORD PTR [rdi+rdx*4], eax add rdx, 1 cmp rcx, rdx jne .L3 whereas LLVM manages: .LBB0_2: # =>This Inner Loop Header: Depth=1 movzx ecx, byte ptr [rsi + rax] # <--- zero extension not ecx and ecx, 1 mov dword ptr [rdi + 4*rax], ecx inc rax cmp rdx, rax jne .LBB0_2 which makes LLVM's output slightly faster (according to llvm-mca) for the same reasons (i.e. lack of conversion from sign extension to zero extension).