https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754
--- Comment #6 from Peter Cordes <peter at cordes dot ca> --- Looks good to me, thanks for taking care of this quickly, hopefully we can get this backported to the GCC11 series to limit the damage for people using these newish intrinsics. I'd love to recommend them for general use, except for this GCC problem where some distros have already shipped GCC versions that compile without error but in a 100% broken way. Portable ways to do narrow alignment/aliasing-safe SIMD loads were sorely lacking; there aren't good effective workarounds for this, especially for 16-bit loads. (I still don't know how to portably / safely write code that will compile to a memory-source PMOVZXBQ across all compilers; Intel's intrinsics API is rather lacking in some areas and relies on compilers folding loads into memory source operands.) > So, isn't that a bug in the intrinsic guide instead? Yes, __m128i _mm_loadu_si16 only really makes sense with SSE2 for PINSRW. Even movzx into an integer reg and then MOVD xmm, eax requires SSE2. With only SSE1 you'd have to movzx / dword store to stack / MOVSS reload. SSE1 makes *some* sense for _mm_loadu_si32 since it can be implemented with a single MOVSS if MOVD isn't available. But we already have SSE1 __m128 _mm_load_ss(const float *) for that. Except GCC's implementation of _mm_load_ss isn't alignment and strict-aliasing safe; it derefs the actual float *__P as _mm_set_ss (*__P). Which I think is a bug, although I'm not clear what semantics Intel intended for that intrinsic. Clang implements it as alignment/aliasing safe with a packed may_alias struct containing a float. MSVC always behaves like -fno-strict-aliasing, and I *think* ICC does, too. Perhaps best to follow the crowd and make all narrow load/store intrinsics alignment and aliasing safe, unless that causes code-gen regressions; users can _mm_set_ss( *ptr ) themselves if they want that to tell the compiler that's its a normal C float object. Was going to report this, but PR84508 is still open and already covers the relevant ss and sd intrinsics. That points out that Intel specifically documents it as not requiring alignment, not mentioning aliasing. ---- Speaking of bouncing through a GP-integer reg, GCC unfortunately does that; it seems to incorrectly think PINSRW xmm, mem, 0 requires -msse4.1, unlike with a GP register source. Reported as PR105066 along with related missed optimizations about folding into a memory source operand for pmovzx/sx. But that's unrelated to correctness; this bug can be closed unless we're keeping it open until it's fixed in the GCC11 current stable series.