https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117562
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- 34.37% 419183 sphinx_livepret sphinx_livepretend_peak.amd64-m64-gcc42-nn [.] utt_decode_block.constprop.0 17.28% 212804 sphinx_livepret sphinx_livepretend_base.amd64-m64-gcc42-nn [.] utt_decode_block.constprop.0 it's the inlined copy of vector_gautbl_eval_logs3 where we now seem to hit vectorized code for for (i = 0; i < veclen; i++) { diff1 = x[i] - m1[i]; dval1 -= diff1 * diff1 * v1[i]; diff2 = x[i] - m2[i]; dval2 -= diff2 * diff2 * v2[i]; } never for the zmm version, few cycles on the ymm and now very many on the xmm version 772 │ vmovaps -0xc0(%rbp),%xmm6 ▒ 4 │ vmulpd %xmm13,%xmm13,%xmm13 ▒ │ vmulpd %xmm1,%xmm1,%xmm1 ▒ 102664 │ vfnmadd132pd %xmm14,%xmm3,%xmm13 ▒ 1 │ vcvtps2pd %xmm6,%xmm3 ▒ │ diff2 = x[i] - m2[i]; ▒ 3 │ vmovaps -0xd0(%rbp),%xmm6 ▒ │ dval1 -= diff1 * diff1 * v1[i]; ▒ 95472 │ vfnmadd132pd %xmm1,%xmm13,%xmm3 We end up also applying basic-block vectorization to the scalar loop via the stores after it: if (dval1 < gautbl->distfloor) dval1 = gautbl->distfloor; if (dval2 < gautbl->distfloor) dval2 = gautbl->distfloor; score[r] = (int32)(f * dval1); score[r+1] = (int32)(f * dval2); interestingly without LTO this specific instance doesn't behave that badly but is faster with the three epilogues: 12.63% 132184 sphinx_livepret sphinx_livepretend_base.amd64-m64-gcc42-nn [.] vector_gautbl_eval_logs3 ▒ 11.55% 119447 sphinx_livepret sphinx_livepretend_peak.amd64-m64-gcc42-nn [.] vector_gautbl_eval_logs3 in turn a similar case elsewhere shows up: 27.63% 285537 sphinx_livepret sphinx_livepretend_peak.amd64-m64-gcc42-nn [.] mgau_eval ◆ 15.19% 158646 sphinx_livepret sphinx_livepretend_base.amd64-m64-gcc42-nn [.] mgau_eval that's basically exactly the same loop kernel. What we can see here is 130 │ vmovhps %xmm4,0x60(%rsp) ▒ 5 │ vmovhps %xmm2,0x70(%rsp) ▒ │ vmovaps 0x70(%rsp),%xmm0 ▒ 10552 │ vcvtps2pd %xmm4,%xmm6 ▒ 1204 │ vcvtps2pd %xmm2,%xmm3 ▒ │ vcvtps2pd %xmm0,%xmm2 ▒ 6872 │ vmovaps 0x60(%rsp),%xmm0 ▒ 1598 │ vmulpd %xmm3,%xmm3,%xmm3 ▒ 5 │ vmulpd %xmm2,%xmm2,%xmm2 ▒ 15119 │ vfnmadd132pd %xmm6,%xmm1,%xmm3 ▒ 38 │ vcvtps2pd %xmm0,%xmm1 ▒ 5088 │ vfnmadd132pd %xmm2,%xmm3,%xmm1 ▒ 33818 │ vunpckhpd %xmm1,%xmm1,%xmm2 ▒ 39938 │ vaddpd %xmm1,%xmm2,%xmm2 ▒ 37932 │ test $0x3,%cl There is an odd spill/reload at 0x60(%rsp) which I can't see where it's coming from. Could it be that V2SFmode is spilled as V2SFmode but reloaded as V4SFmode?! #(insn:TI 1161 1175 1578 15 (set (mem/c:V4SF (plus:DI (reg/f:DI 7 sp) # (const_int 112 [0x70])) [11 %sfp+-16 S16 A128]) # (vec_select:V4SF (vec_concat:V8SF (mem/c:V4SF (plus:DI (reg/f:DI 7 sp) # (const_int 112 [0x70])) [11 %sfp+-16 S16 A128]) # (reg:V4SF 22 xmm2 [orig:566 vect__811.229 ] [566])) # (parallel [ # (const_int 6 [0x6]) # (const_int 7 [0x7]) # (const_int 2 [0x2]) # (const_int 3 [0x3]) # ]))) "cont_mgau.c":157:9 5181 {sse_movhlps} # (expr_list:REG_DEAD (reg:V4SF 22 xmm2 [orig:566 vect__811.229 ] [566]) # (nil))) vmovhps %xmm2, 112(%rsp) # 1161 [c=4 l=8] sse_movhlps/4 #(insn 1578 1161 1173 15 (set (reg:V4SF 20 xmm0 [853]) # (mem/c:V4SF (plus:DI (reg/f:DI 7 sp) # (const_int 112 [0x70])) [11 %sfp+-16 S16 A128])) "cont_mgau.c":157:9 2354 {movv4sf_internal} # (nil)) vmovaps 112(%rsp), %xmm0 # 1578 [c=10 l=8] movv4sf_internal/3 Huh. It looks like this is from a V4SF -> 2xV2DF extension via vec_unpack_{hi,lo}_expr. Originally this is (insn 1161 1160 1162 58 (set (reg:V4SF 853) (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 853) (reg:V4SF 566 [ vect__811.229 ])) (parallel [ (const_int 6 [0x6]) (const_int 7 [0x7]) (const_int 2 [0x2]) (const_int 3 [0x3]) ]))) "cont_mgau.c":157:9 5181 {sse_movhlps} (expr_list:REG_DEAD (reg:V4SF 566 [ vect__811.229 ]) (nil))) but LRA chooses the memory alternative for the destination: Choosing alt 4 in insn 1161: (0) m (1) 0 (2) v {sse_movhlps} Considering alt=0 of insn 1162: (0) =v (1) v overall=6,losers=1,rld_nregs=1 Choosing alt 0 in insn 1162: (0) =v (1) v {sse2_cvtps2pd} Creating newreg=998 from oldreg=853, assigning class ALL_SSE_REGS to r998 1162: r568:V2DF=float_extend(vec_select(r998:V4SF,parallel)) Inserting insn reload before: 1578: r998:V4SF=r853:V4SF This is the following pattern: (define_insn "sse_movhlps" [(set (match_operand:V4SF 0 "nonimmediate_operand" "=x,v,x,v,m") (vec_select:V4SF (vec_concat:V8SF (match_operand:V4SF 1 "nonimmediate_operand" " 0,v,0,v,0") (match_operand:V4SF 2 "nonimmediate_operand" " x,v,o,o,v")) (parallel [(const_int 6) (const_int 7) (const_int 2) (const_int 3)])))] "TARGET_SSE && !(MEM_P (operands[1]) && MEM_P (operands[2]))" "@ movhlps\t{%2, %0|%0, %2} vmovhlps\t{%2, %1, %0|%0, %1, %2} movlps\t{%H2, %0|%0, %H2} vmovlps\t{%H2, %1, %0|%0, %1, %H2} %vmovhps\t{%2, %0|%q0, %2}" [(set_attr "isa" "noavx,avx,noavx,avx,*") (set_attr "type" "ssemov2") (set_attr "prefix" "orig,maybe_evex,orig,maybe_evex,maybe_vex") (set_attr "mode" "V4SF,V4SF,V2SF,V2SF,V2SF")]) indeed the "mode" attr says V2SF for the memory (store) alternative but this is for a V4SFmode. Also LRA doesn't seem to understand that the match_operand:1 should be the same memory. (insn 1161 1160 1578 59 (set (mem/c:V4SF (plus:DI (reg/f:DI 7 sp) (const_int 112 [0x70])) [11 %sfp+-16 S16 A128]) (vec_select:V4SF (vec_concat:V8SF (mem/c:V4SF (plus:DI (reg/f:DI 7 sp) (const_int 112 [0x70])) [11 %sfp+-16 S16 A128]) (reg:V4SF 22 xmm2 [orig:566 vect__811.229 ] [566])) (parallel [ (const_int 6 [0x6]) (const_int 7 [0x7]) (const_int 2 [0x2]) (const_int 3 [0x3]) ]))) "cont_mgau.c":157:9 5181 {sse_movhlps} (nil)) it's also odd that I don't see the spill to (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)(const_int 112 [0x70])) that LRA would need to generate for the input operand. Clearly something is odd here and clearly this alternative is bad.