https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68924

            Bug ID: 68924
           Summary: No intrinsic for x86  `MOVQ m64, %xmm`  in 32bit mode.
           Product: gcc
           Version: 5.3.0
               URL: http://stackoverflow.com/questions/34279513/loading-8-
                    chars-from-memory-into-an-m256-variable-as-packed-sing
                    le-precision-f
            Status: UNCONFIRMED
          Keywords: missed-optimization, ssemmx
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: i386-linux-gnu

context and background:
http://stackoverflow.com/questions/34279513/loading-8-chars-from-memory-into-an-m256-variable-as-packed-single-precision-f

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68923


gcc and clang don't even provide the _mm_cvtsi64_si128 intrinsic for movq in
32bit mode (ICC does, see below).  They still provide  m128i
_mm_mov_epi64(__m128i a), but at -O0 the load of the source __m128i won't fold
into the movq, so you'd get an undesired 128b load that could cross a page
boundary and segfault.


The lack of this, and lack of an intrinsic for PMOVZX as a load from a narrower
source, is a design flaw in the intrinsics, IMO.  I think it's super dumb to be
forced to use an intrinsic for an instruction I don't want (movq), even if it
didn't cause a portability issue for x86-32bit.


Consider trying to get gcc to emit `VPMOVZXBD  (%src), %ymm0` for 32bit mode:

#include <immintrin.h>
#include <stdint.h>
__m256 load_bytes_to_m256(uint8_t *p)
{
    __m128i small_load = _mm_cvtsi64_si128( *(uint64_t*)p );
    __m256i intvec = _mm256_cvtepu8_epi32( small_load );
    return _mm256_cvtepi32_ps(intvec);
}

That's the same code as in the other bug report (about the failure to fold the
load into a memory source operand for vpmovzx:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68923 ), but with the #ifdefs
taken out


--------


_mm_cvtsi64_si128 is the intrinsic for the MOVQ %r/m64, %xmm form of MOVQ. 
(This is the MOVD/MOVQ entry in Intel's manual).  Its non-VEX encoding includes
a REX prefix, and even the VEX encoding of it is illegal in 32bit mode (prob.
because it couldn't decide if the insn was legal or not until it checked the
mod/rm byte to see if it encoded a 64b register source, instead of a 64b memory
location).  Since the other MOVQ gives identical results, and has a shorter
non-VEX encoding, there's no reason to bother with that complexity.

The other MOVQ (the one Intel's insn ref lists under just MOVQ), which can be
used for %mm,%mm reg moves, or the low half of %xmm,%xmm regs, only has a m128i
to m128i intrinsic:  m128i _mm_mov_epi64(__m128i a), not a load form (same
problem as the pmovz/sx intrinsics).



------------

Other than this design-flaw in the intrinsics, you could see it as only a bug
in gcc/clang's implementation, since Intel's own implementation does still make
it possible to get MOVQ m64, %xmm emitted in 32bit mode.


ICC13 still provides _mm_cvtsi64_si128 in 32bit mode, and will use the MOVQ
xmm, m64 form as a load.  If it has a uint64_t in two 32bit registers, it
emulates it with 2xMOVD %r32, %xmm and a PUNPCKLDQ.  http://goo.gl/LQkVJL.  Two
32b stores then a movq load would cause a store-forwarding failure stall.   
vmovd/vpinsrd would be fewer instructions, but pinsrd is a 2-uop instruction on
Intel SnB-family CPUs, so as far as uops they're equal: 3 uops for the shuffle
port (port5).

At -O0, ICC emulates it that way even if the value is in memory, with 2x MOVD
m32, %xmm and a PUNPCK, so even Intel's compiler "thinks of" the intrinsic as
normally being the MOVQ %r/m64, %xmm form, not the MOVQ %xmm/m64, %xmm form.

Reply via email to