https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91546

            Bug ID: 91546
           Summary: Better solution for VEC_INIT under TARGET_SSE4_1 since
                    PINSRB/PINSRD/PINSRQ
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: crazylht at gmail dot com
  Target Milestone: ---
            Target: i386, x86-64

for testcase:

#include<immintrin.h>
__m128
test2 (int a,int b,int c,int d)
{ 
  return __extension__ (__m128) (__v4si) {a, b, c, d};
}

comile with -Ofast -march=skylake-avx512
gcc generate

test2(int, int, int, int):
        vmovd   xmm2, edx
        vmovd   xmm3, edi
        vpinsrd xmm1, xmm2, ecx, 1
        vpinsrd xmm0, xmm3, esi, 1
        vpunpcklqdq     xmm0, xmm0, xmm1
        ret

while clang generate:

test2(int, int, int, int):            
        vmovd   xmm0, edi
        vpinsrd xmm0, xmm0, esi, 1
        vpinsrd xmm0, xmm0, edx, 2
        vpinsrd xmm0, xmm0, ecx, 3
        ret

One instruction less for V4SI, more instructions less for V8SI/V16SI, similar
for V8QI/V2DI.

It also will make cost of vec_contruct in vectorization more realistic.

21119      case vec_construct:
21120        {
21121          /* N element inserts into SSE vectors.  */
21122          int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
21123          /* One vinserti128 for combining two SSE vectors for AVX256.  */
21124          if (GET_MODE_BITSIZE (mode) == 256)
21125            cost += ix86_vec_cost (mode, ix86_cost->addss);
21126          /* One vinserti64x4 and two vinserti128 for combining SSE
21127             and AVX256 vectors to AVX512.  */
21128          else if (GET_MODE_BITSIZE (mode) == 512)
21129            cost += 3 * ix86_vec_cost (mode, ix86_cost->addss);
21130          return cost;

Reply via email to