[Bug rtl-optimization/57193] New: suboptimal register allocation for SSE registers

vermaelen.wouter at gmail dot com Tue, 07 May 2013 02:36:16 -0700


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57193




             Bug #: 57193

           Summary: suboptimal register allocation for SSE registers

    Classification: Unclassified

           Product: gcc

           Version: 4.9.0

            Status: UNCONFIRMED

          Severity: normal

          Priority: P3

         Component: rtl-optimization

        AssignedTo: unassig...@gcc.gnu.org

        ReportedBy: vermaelen.wou...@gmail.com





This bug _might_ be related to PR56339, although that report talks about a

regression compared to 4.7, while this bug seems to be a regression compared to

4.4.



I was converting some hand-written asm code to SSE-intrinsics, but

unfortunately the version using intrinsics generates worse code. It contains

two unnecessary 'movdqa' instructions.



I managed to reduce my test to this routine:



//--------------------------------------------------------------

#include <emmintrin.h>



void test1(const __m128i* in1, const __m128i* in2, __m128i* out,

           __m128i f, __m128i zero)

{

    __m128i c = _mm_avg_epu8(*in1, *in2);

    __m128i l = _mm_unpacklo_epi8(c, zero);

    __m128i h = _mm_unpackhi_epi8(c, zero);

    __m128i m = _mm_mulhi_epu16(l, f);

    __m128i n = _mm_mulhi_epu16(h, f);

    *out = _mm_packus_epi16(m, n);

}

//--------------------------------------------------------------



A (few days old) gcc snapshot generates the following code. Versions 4.5, 4.6

and 4.7 generate similar code:



   0:   66 0f 6f 17             movdqa (%rdi),%xmm2

   4:   66 0f e0 16             pavgb  (%rsi),%xmm2

   8:   66 0f 6f da             movdqa %xmm2,%xmm3

   c:   66 0f 68 d1             punpckhbw %xmm1,%xmm2

  10:   66 0f 60 d9             punpcklbw %xmm1,%xmm3

  14:   66 0f e4 d0             pmulhuw %xmm0,%xmm2

  18:   66 0f 6f cb             movdqa %xmm3,%xmm1

  1c:   66 0f e4 c8             pmulhuw %xmm0,%xmm1

  20:   66 0f 6f c1             movdqa %xmm1,%xmm0

  24:   66 0f 67 c2             packuswb %xmm2,%xmm0

  28:   66 0f 7f 02             movdqa %xmm0,(%rdx)

  2c:   c3                      retq



Gcc version 4.3 and 4.4 (and clang) generate the following optimal(?) code:

   0:   66 0f 6f 17             movdqa (%rdi),%xmm2

   4:   66 0f e0 16             pavgb  (%rsi),%xmm2

   8:   66 0f 6f da             movdqa %xmm2,%xmm3

   c:   66 0f 68 d1             punpckhbw %xmm1,%xmm2

  10:   66 0f 60 d9             punpcklbw %xmm1,%xmm3

  14:   66 0f e4 d8             pmulhuw %xmm0,%xmm3

  18:   66 0f e4 c2             pmulhuw %xmm2,%xmm0

  1c:   66 0f 67 d8             packuswb %xmm0,%xmm3

  20:   66 0f 7f 1a             movdqa %xmm3,(%rdx)

  24:   c3                      retq

[Bug rtl-optimization/57193] New: suboptimal register allocation for SSE registers

Reply via email to