https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102392

Gabriel Ravier <gabravier at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|X86_64-linux-gnu            |x86_64-linux-gnu
            Version|12.0                        |15.0

--- Comment #5 from Gabriel Ravier <gabravier at gmail dot com> ---
I've wound up stumbling upon a very similar bug (which I think is the same bug
at its core) while examining the following code:

static uint32_t f(int8_t x)
{
    return (~(uint32_t)x) & 1;
}

void floop(uint32_t *r, int8_t *x, size_t n)
{
#ifndef __clang__
_Pragma("GCC unroll 0") _Pragma("GCC novector")
#else
_Pragma("clang loop unroll(disable) vectorize(disable)")
#endif
    for (size_t i = 0; i < n; ++i)
        r[i] = f(x[i]);
}

where for the loop, GCC generates:

.L3:
  movsx eax, BYTE PTR [rsi+rdx]        # <--- sign extension
  not eax
  and eax, 1
  mov DWORD PTR [rdi+rdx*4], eax
  add rdx, 1
  cmp rcx, rdx
  jne .L3

whereas LLVM manages:

.LBB0_2: # =>This Inner Loop Header: Depth=1
  movzx ecx, byte ptr [rsi + rax]       # <--- zero extension
  not ecx
  and ecx, 1
  mov dword ptr [rdi + 4*rax], ecx
  inc rax
  cmp rdx, rax
  jne .LBB0_2

which makes LLVM's output slightly faster (according to llvm-mca) for the same
reasons (i.e. lack of conversion from sign extension to zero extension).

Reply via email to