https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90991

            Bug ID: 90991
           Summary: _mm_loadu_ps instrinsic translates to vmovaps in
                    combination with _mm512_insertf32x4
           Product: gcc
           Version: 9.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: kronbichler.martin at gmail dot com
  Target Milestone: ---

Created attachment 46515
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46515&action=edit
Reduced C/C++ file for reproducing the bug

For the following code:

#include <x86intrin.h>
__m512 f(const float       *in,
         const unsigned    *offsets)
{
  __m512 t0 = {};
  t0 = _mm512_insertf32x4(t0, _mm_loadu_ps(in + offsets[0]), 0);
  t0 = _mm512_insertf32x4(t0, _mm_loadu_ps(in + offsets[1]), 1);
  t0 = _mm512_insertf32x4(t0, _mm_loadu_ps(in + offsets[2]), 2);
  t0 = _mm512_insertf32x4(t0, _mm_loadu_ps(in + offsets[3]), 3);
  return t0;
}

compiled on x86-64 with "g++ -march=skylake-avx512 -O1 -S", I get the following
assembly:

        movl    (%rsi), %eax
        vmovaps (%rdi,%rax,4), %xmm0
        movl    4(%rsi), %eax
        vinsertf32x4    $1, (%rdi,%rax,4), %zmm0, %zmm0
        movl    8(%rsi), %eax
        vinsertf32x4    $2, (%rdi,%rax,4), %zmm0, %zmm0
        movl    12(%rsi), %eax
        vinsertf32x4    $3, (%rdi,%rax,4), %zmm0, %zmm0
        ret

For the lowest 128 bits done in the second line, gcc wrongly emits an aligned
load instruction vmovaps. This cannot be right because the offsets array could
be {0, 1, 2, 3} for example, in which case 3 of 4 loads are not 16-byte aligned
no matter the alignment of "in". And it indeed segfaults on simple tests. The
compiler hence ignores my unaligned load instruction "_mm_loadu_ps". According
to https://godbolt.org/z/j7T4Fy this wrong code has been introduced with gcc-9,
as compilation with gcc 8.3 gives the more sane code

        movl    (%rsi), %eax
        vxorps  %xmm0, %xmm0, %xmm0
        vinsertf32x4    $0, (%rdi,%rax,4), %zmm0, %zmm0
        movl    4(%rsi), %eax
        vinsertf32x4    $1, (%rdi,%rax,4), %zmm0, %zmm0
        movl    8(%rsi), %eax
        vinsertf32x4    $2, (%rdi,%rax,4), %zmm0, %zmm0
        movl    12(%rsi), %eax
        vinsertf32x4    $3, (%rdi,%rax,4), %zmm0, %zmm0
        ret

I guess the compiler merges the lowest vinsertf32x4 instruction with the vxorps
instruction. At this stage, it has already lost track of the original unaligned
load instruction hidden behind the vinsertf32x4 instruction which is is allowed
to have misaligned addresses whereas vmovaps does not. The correct solution
would have been to emit "vmovups (%rdi,%rax,4), %xmm0".

godbolt also reports the error on trunk, so I assume it has not been fixed yet.
It also appears with "-march=knl".

Here are my details of my compiler:

gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/lnm/gcc9/libexec/gcc/x86_64-redhat-linux/9.1.0/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../gcc-9.1.0/configure --prefix=/lnm/gcc9 --enable-bootstrap
--enable-shared --enable-threads=posix --enable-checking=release
--with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions
--enable-gnu-unique-object --enable-linker-build-id
--with-linker-hash-style=gnu --enable-languages=c,c++,fortran,lto
--disable-multilib --build=x86_64-redhat-linux
Thread model: posix
gcc version 9.1.0 (GCC) 

I also attach a reduced variant for compilation with gcc/g++ without the
include to x86intrin.h for more easily reproducing the bug.

Reply via email to