https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68924
Bug ID: 68924 Summary: No intrinsic for x86 `MOVQ m64, %xmm` in 32bit mode. Product: gcc Version: 5.3.0 URL: http://stackoverflow.com/questions/34279513/loading-8- chars-from-memory-into-an-m256-variable-as-packed-sing le-precision-f Status: UNCONFIRMED Keywords: missed-optimization, ssemmx Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: i386-linux-gnu context and background: http://stackoverflow.com/questions/34279513/loading-8-chars-from-memory-into-an-m256-variable-as-packed-single-precision-f https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68923 gcc and clang don't even provide the _mm_cvtsi64_si128 intrinsic for movq in 32bit mode (ICC does, see below). They still provide m128i _mm_mov_epi64(__m128i a), but at -O0 the load of the source __m128i won't fold into the movq, so you'd get an undesired 128b load that could cross a page boundary and segfault. The lack of this, and lack of an intrinsic for PMOVZX as a load from a narrower source, is a design flaw in the intrinsics, IMO. I think it's super dumb to be forced to use an intrinsic for an instruction I don't want (movq), even if it didn't cause a portability issue for x86-32bit. Consider trying to get gcc to emit `VPMOVZXBD (%src), %ymm0` for 32bit mode: #include <immintrin.h> #include <stdint.h> __m256 load_bytes_to_m256(uint8_t *p) { __m128i small_load = _mm_cvtsi64_si128( *(uint64_t*)p ); __m256i intvec = _mm256_cvtepu8_epi32( small_load ); return _mm256_cvtepi32_ps(intvec); } That's the same code as in the other bug report (about the failure to fold the load into a memory source operand for vpmovzx: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68923 ), but with the #ifdefs taken out -------- _mm_cvtsi64_si128 is the intrinsic for the MOVQ %r/m64, %xmm form of MOVQ. (This is the MOVD/MOVQ entry in Intel's manual). Its non-VEX encoding includes a REX prefix, and even the VEX encoding of it is illegal in 32bit mode (prob. because it couldn't decide if the insn was legal or not until it checked the mod/rm byte to see if it encoded a 64b register source, instead of a 64b memory location). Since the other MOVQ gives identical results, and has a shorter non-VEX encoding, there's no reason to bother with that complexity. The other MOVQ (the one Intel's insn ref lists under just MOVQ), which can be used for %mm,%mm reg moves, or the low half of %xmm,%xmm regs, only has a m128i to m128i intrinsic: m128i _mm_mov_epi64(__m128i a), not a load form (same problem as the pmovz/sx intrinsics). ------------ Other than this design-flaw in the intrinsics, you could see it as only a bug in gcc/clang's implementation, since Intel's own implementation does still make it possible to get MOVQ m64, %xmm emitted in 32bit mode. ICC13 still provides _mm_cvtsi64_si128 in 32bit mode, and will use the MOVQ xmm, m64 form as a load. If it has a uint64_t in two 32bit registers, it emulates it with 2xMOVD %r32, %xmm and a PUNPCKLDQ. http://goo.gl/LQkVJL. Two 32b stores then a movq load would cause a store-forwarding failure stall. vmovd/vpinsrd would be fewer instructions, but pinsrd is a 2-uop instruction on Intel SnB-family CPUs, so as far as uops they're equal: 3 uops for the shuffle port (port5). At -O0, ICC emulates it that way even if the value is in memory, with 2x MOVD m32, %xmm and a PUNPCK, so even Intel's compiler "thinks of" the intrinsic as normally being the MOVQ %r/m64, %xmm form, not the MOVQ %xmm/m64, %xmm form.