https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90991
Bug ID: 90991 Summary: _mm_loadu_ps instrinsic translates to vmovaps in combination with _mm512_insertf32x4 Product: gcc Version: 9.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: kronbichler.martin at gmail dot com Target Milestone: --- Created attachment 46515 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46515&action=edit Reduced C/C++ file for reproducing the bug For the following code: #include <x86intrin.h> __m512 f(const float *in, const unsigned *offsets) { __m512 t0 = {}; t0 = _mm512_insertf32x4(t0, _mm_loadu_ps(in + offsets[0]), 0); t0 = _mm512_insertf32x4(t0, _mm_loadu_ps(in + offsets[1]), 1); t0 = _mm512_insertf32x4(t0, _mm_loadu_ps(in + offsets[2]), 2); t0 = _mm512_insertf32x4(t0, _mm_loadu_ps(in + offsets[3]), 3); return t0; } compiled on x86-64 with "g++ -march=skylake-avx512 -O1 -S", I get the following assembly: movl (%rsi), %eax vmovaps (%rdi,%rax,4), %xmm0 movl 4(%rsi), %eax vinsertf32x4 $1, (%rdi,%rax,4), %zmm0, %zmm0 movl 8(%rsi), %eax vinsertf32x4 $2, (%rdi,%rax,4), %zmm0, %zmm0 movl 12(%rsi), %eax vinsertf32x4 $3, (%rdi,%rax,4), %zmm0, %zmm0 ret For the lowest 128 bits done in the second line, gcc wrongly emits an aligned load instruction vmovaps. This cannot be right because the offsets array could be {0, 1, 2, 3} for example, in which case 3 of 4 loads are not 16-byte aligned no matter the alignment of "in". And it indeed segfaults on simple tests. The compiler hence ignores my unaligned load instruction "_mm_loadu_ps". According to https://godbolt.org/z/j7T4Fy this wrong code has been introduced with gcc-9, as compilation with gcc 8.3 gives the more sane code movl (%rsi), %eax vxorps %xmm0, %xmm0, %xmm0 vinsertf32x4 $0, (%rdi,%rax,4), %zmm0, %zmm0 movl 4(%rsi), %eax vinsertf32x4 $1, (%rdi,%rax,4), %zmm0, %zmm0 movl 8(%rsi), %eax vinsertf32x4 $2, (%rdi,%rax,4), %zmm0, %zmm0 movl 12(%rsi), %eax vinsertf32x4 $3, (%rdi,%rax,4), %zmm0, %zmm0 ret I guess the compiler merges the lowest vinsertf32x4 instruction with the vxorps instruction. At this stage, it has already lost track of the original unaligned load instruction hidden behind the vinsertf32x4 instruction which is is allowed to have misaligned addresses whereas vmovaps does not. The correct solution would have been to emit "vmovups (%rdi,%rax,4), %xmm0". godbolt also reports the error on trunk, so I assume it has not been fixed yet. It also appears with "-march=knl". Here are my details of my compiler: gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/lnm/gcc9/libexec/gcc/x86_64-redhat-linux/9.1.0/lto-wrapper Target: x86_64-redhat-linux Configured with: ../gcc-9.1.0/configure --prefix=/lnm/gcc9 --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,fortran,lto --disable-multilib --build=x86_64-redhat-linux Thread model: posix gcc version 9.1.0 (GCC) I also attach a reduced variant for compilation with gcc/g++ without the include to x86intrin.h for more easily reproducing the bug.