[Bug target/91117] New: _mm_movpi64_epi64/_mm_movepi64_pi64 generating store+load instead of using MOVQ2DQ/MOVDQ2Q

2019-07-08 Thread wolfwings+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91117

Bug ID: 91117
   Summary: _mm_movpi64_epi64/_mm_movepi64_pi64 generating
store+load instead of using MOVQ2DQ/MOVDQ2Q
   Product: gcc
   Version: 9.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wolfwings+gcc at gmail dot com
  Target Milestone: ---

_mm_movpi64_epi64 is never using MOVQ2DQ (and _mm_movepi64_pi64 never using
MOVDQ2Q) despite documentation it should when used in mixed MMX -> SSE
situations, and that these are in fact the intrinsics to use when desiring the
Q2DQ/DQ2Q opcodes.

This appears to be due to the header defining them causing fallback memory
write then read except in (technically invalid) SSE -> SSE cases where a MOVD
is used.

Tested on GCC 7.4 + 9.1 locally, with additional testing on Godbolt all showing
identical code being generated all the way back to 4.x series.

Compiled with -O1:

#include 

__m128i test( __m128i input ) {
__m64 x = _mm_movepi64_pi64( input );
return _mm_movpi64_epi64( _mm_mullo_pi16( x, x ) );
}

Generated assembly on GCC 9.1:

movq%xmm0, -16(%rsp)
movq-16(%rsp), %mm0
movq%mm0, %mm1
pmullw  %mm0, %mm1
movq%mm1, -16(%rsp)
movq-16(%rsp), %xmm0
ret

A version that makes explicit calls to movq2dq/movdq2q works and outputs the
expected assembly sequence:

#include 

static inline __m64 _my_movepi64_pi64( __m128i input ) {
__m64 result;
asm( "movdq2q %1, %0" : "=y" (result) : "x" (input) : );
return result;
}

static inline __m128i _my_movpi64_epi64( __m64 input ) {
__m128i result;
asm( "movq2dq %1, %0" : "=x" (result) : "y" (input) : );
return result;
}

__m128i test( __m128i input ) {
__m64 x = _my_movepi64_pi64( input );
return _my_movpi64_epi64( _mm_mullo_pi16( x, x ) );
}

Generated assembly on GCC 7.4, 9.1, and others via Godbolt, again with -O1 (-O2
and -O3 make no difference):

movdq2q %xmm0, %mm0
pmullw  %mm0, %mm0
movq2dq %mm0, %xmm0
ret

For completeness, ICC generates the 'short' code form on all available versions
without needing the inline assembly workaround.

[Bug target/91117] _mm_movpi64_epi64/_mm_movepi64_pi64 generating store+load instead of using MOVQ2DQ/MOVDQ2Q

2019-07-09 Thread wolfwings+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91117

--- Comment #2 from Wolf .  ---
Is there any way to still access MMX instructions/registers with intrinsics in
gcc-10 at all or is this bug "fixable but that code generation option will go
away entirely with gcc-10 so it wouldn't matter" then?

The full code block I found this hiccup in was running out of SSE registers on
the pre-AVX Silvermont platform it's on (so no 3-op VEX available) so I moved
some early operations to MMX registers instead.

This freed up all register pressure, allowing the constants used to stay
in-register for the full processing loop and avoiding almost all spills/stalls,
which is where I found this hiccup in the first place.