https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89114
Bug ID: 89114 Summary: rtx_cost of VEC_SELECT, VEC_CONCAT and VEC_DUPLICATE with memory operands is wrong Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rguenth at gcc dot gnu.org Target Milestone: --- Split out from PR89049. On its testcase combine is willing to elide an unnecessary %ymm build-up but the targets RTX cost makes that not profitable. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049#c5 So with (the bogus) Index: gcc/config/i386/i386.c =================================================================== --- gcc/config/i386/i386.c (revision 268383) +++ gcc/config/i386/i386.c (working copy) @@ -40848,7 +40848,7 @@ ix86_rtx_costs (rtx x, machine_mode mode recognizable. In which case they all pretty much have the same cost. */ *total = cost->sse_op; - return true; + return false; case VEC_MERGE: mask = XEXP (x, 2); /* This is masked instruction, assume the same cost, we get combine to do Trying 11 -> 25: 11: r105:V8SF=vec_concat(r106:V4SF,[r85:DI+0x10]) 25: r111:V4SF=vec_select(r105:V8SF,parallel) REG_DEAD r105:V8SF Successfully matched this instruction: (set (reg:V4SF 111) (mem:V4SF (plus:DI (reg:DI 85 [ ivtmp.11 ]) (const_int 16 [0x10])) [1 MEM[base: _2, offset: 0B]+16 S16 A32])) allowing combination of insns 11 and 25 original costs 16 + 12 = 28 replacement cost 12 and we elide the %ymm build: .L2: vmovups (%rdi), %xmm1 addq $32, %rdi vaddss %xmm1, %xmm0, %xmm0 vshufps $85, %xmm1, %xmm1, %xmm2 vaddss %xmm2, %xmm0, %xmm0 vunpckhps %xmm1, %xmm1, %xmm2 vshufps $255, %xmm1, %xmm1, %xmm1 vaddss %xmm2, %xmm0, %xmm0 vaddss %xmm1, %xmm0, %xmm0 vmovups -16(%rdi), %xmm1 vshufps $85, %xmm1, %xmm1, %xmm2 vaddss %xmm1, %xmm0, %xmm0 vaddss %xmm2, %xmm0, %xmm0 vunpckhps %xmm1, %xmm1, %xmm2 vshufps $255, %xmm1, %xmm1, %xmm1 vaddss %xmm2, %xmm0, %xmm0 vaddss %xmm1, %xmm0, %xmm0 cmpq %rdi, %rax jne .L2 the patch is bogus because the intention of not scanning sub-rtxen was to match the various shuffle patterns which do sth like (vec_select (vec_concat ..) ...). Not sure if there's a helper in i386.c to extract/cost a single MEM sub-rtx, but the course of action would be to properly do this somehow.