MOVDQ2Q

wolfwings+gcc at gmail dot com Mon, 08 Jul 2019 15:31:58 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91117


            Bug ID: 91117
           Summary: _mm_movpi64_epi64/_mm_movepi64_pi64 generating
                    store+load instead of using MOVQ2DQ/MOVDQ2Q
           Product: gcc
           Version: 9.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: wolfwings+gcc at gmail dot com
  Target Milestone: ---

_mm_movpi64_epi64 is never using MOVQ2DQ (and _mm_movepi64_pi64 never using
MOVDQ2Q) despite documentation it should when used in mixed MMX -> SSE
situations, and that these are in fact the intrinsics to use when desiring the
Q2DQ/DQ2Q opcodes.

This appears to be due to the header defining them causing fallback memory
write then read except in (technically invalid) SSE -> SSE cases where a MOVD
is used.

Tested on GCC 7.4 + 9.1 locally, with additional testing on Godbolt all showing
identical code being generated all the way back to 4.x series.

Compiled with -O1:

#include <emmintrin.h>

__m128i test( __m128i input ) {
        __m64 x = _mm_movepi64_pi64( input );
        return _mm_movpi64_epi64( _mm_mullo_pi16( x, x ) );
}

Generated assembly on GCC 9.1:

        movq    %xmm0, -16(%rsp)
        movq    -16(%rsp), %mm0
        movq    %mm0, %mm1
        pmullw  %mm0, %mm1
        movq    %mm1, -16(%rsp)
        movq    -16(%rsp), %xmm0
        ret

A version that makes explicit calls to movq2dq/movdq2q works and outputs the
expected assembly sequence:

#include <emmintrin.h>

static inline __m64 _my_movepi64_pi64( __m128i input ) {
        __m64 result;
        asm( "movdq2q %1, %0" : "=y" (result) : "x" (input) : );
        return result;
}

static inline __m128i _my_movpi64_epi64( __m64 input ) {
        __m128i result;
        asm( "movq2dq %1, %0" : "=x" (result) : "y" (input) : );
        return result;
}

__m128i test( __m128i input ) {
        __m64 x = _my_movepi64_pi64( input );
        return _my_movpi64_epi64( _mm_mullo_pi16( x, x ) );
}

Generated assembly on GCC 7.4, 9.1, and others via Godbolt, again with -O1 (-O2
and -O3 make no difference):

        movdq2q %xmm0, %mm0
        pmullw  %mm0, %mm0
        movq2dq %mm0, %xmm0
        ret

For completeness, ICC generates the 'short' code form on all available versions
without needing the inline assembly workaround.

[Bug target/91117] New: _mm_movpi64_epi64/_mm_movepi64_pi64 generating store+load instead of using MOVQ2DQ/MOVDQ2Q

Reply via email to