https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91117
Bug ID: 91117 Summary: _mm_movpi64_epi64/_mm_movepi64_pi64 generating store+load instead of using MOVQ2DQ/MOVDQ2Q Product: gcc Version: 9.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wolfwings+gcc at gmail dot com Target Milestone: --- _mm_movpi64_epi64 is never using MOVQ2DQ (and _mm_movepi64_pi64 never using MOVDQ2Q) despite documentation it should when used in mixed MMX -> SSE situations, and that these are in fact the intrinsics to use when desiring the Q2DQ/DQ2Q opcodes. This appears to be due to the header defining them causing fallback memory write then read except in (technically invalid) SSE -> SSE cases where a MOVD is used. Tested on GCC 7.4 + 9.1 locally, with additional testing on Godbolt all showing identical code being generated all the way back to 4.x series. Compiled with -O1: #include <emmintrin.h> __m128i test( __m128i input ) { __m64 x = _mm_movepi64_pi64( input ); return _mm_movpi64_epi64( _mm_mullo_pi16( x, x ) ); } Generated assembly on GCC 9.1: movq %xmm0, -16(%rsp) movq -16(%rsp), %mm0 movq %mm0, %mm1 pmullw %mm0, %mm1 movq %mm1, -16(%rsp) movq -16(%rsp), %xmm0 ret A version that makes explicit calls to movq2dq/movdq2q works and outputs the expected assembly sequence: #include <emmintrin.h> static inline __m64 _my_movepi64_pi64( __m128i input ) { __m64 result; asm( "movdq2q %1, %0" : "=y" (result) : "x" (input) : ); return result; } static inline __m128i _my_movpi64_epi64( __m64 input ) { __m128i result; asm( "movq2dq %1, %0" : "=x" (result) : "y" (input) : ); return result; } __m128i test( __m128i input ) { __m64 x = _my_movepi64_pi64( input ); return _my_movpi64_epi64( _mm_mullo_pi16( x, x ) ); } Generated assembly on GCC 7.4, 9.1, and others via Godbolt, again with -O1 (-O2 and -O3 make no difference): movdq2q %xmm0, %mm0 pmullw %mm0, %mm0 movq2dq %mm0, %xmm0 ret For completeness, ICC generates the 'short' code form on all available versions without needing the inline assembly workaround.