https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154
--- Comment #25 from rguenther at suse dot de <rguenther at suse dot de> --- On Sat, 17 Aug 2019, ubizjak at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154 > > --- Comment #24 from Uroš Bizjak <ubizjak at gmail dot com> --- > It looks that the patch introduced a (small?) runtime regression of 5% in > SPEC2000 300.twolf on haswell [1]. Maybe worth looking at. Biggest changes when benchmarking -mno-stv (base) against -mstv (peak): 7.28% 37789 twolf_peak.none twolf_peak.none [.] ucxx2 4.21% 25709 twolf_base.none twolf_base.none [.] ucxx2 3.72% 22584 twolf_base.none twolf_base.none [.] new_dbox 2.48% 22364 twolf_peak.none twolf_peak.none [.] new_dbox 1.49% 8270 twolf_base.none twolf_base.none [.] sub_penal 1.12% 7576 twolf_peak.none twolf_peak.none [.] sub_penal 1.36% 9314 twolf_peak.none twolf_peak.none [.] old_assgnto_new2 1.11% 5257 twolf_base.none twolf_base.none [.] old_assgnto_new2 and in ucxx2 I see 0.17 │ mov %eax,(%rsp) 3.55 │ vpmins (%rsp),%xmm0,%xmm1 │ test %eax,%eax 0.22 │ vmovd %xmm1,%r8d 0.80 │ cmovs %esi,%r8d This is from code like a1LoBin = Trybin/binWidth < 0 ? 0 : (Trybin>numBins ? numBins : Trybin) with only the inner one recognized as MIN because 'numBins' is only ever loaded conditionally and we don't speculate it. So we expand from _41 = _40 / binWidth.15_36; if (_41 >= 0) goto <bb 5>; [59.00%] else goto <bb 6>; [41.00%] bb5: numBins.26_42 = numBins; iftmp.19_315 = MIN_EXPR <_41, numBins.26_42>; bb6: # iftmp.19_267 = PHI <iftmp.19_315(5), 0(4)> ending up with movl %r9d, %eax cltd idivl %ecx movl %eax, (%rsp) vpminsd (%rsp), %xmm0, %xmm1 testl %eax, %eax vmovd %xmm1, %r11d cmovs %esi, %r11d and STV converting single-instruction 'chains': Collected chain #40... insns: 381 defs to convert: r463, r465 Computing gain for chain #40... Instruction gain 8 for 381: {r465:SI=smin(r463:SI,[`numBins']);clobber flags:CC;} REG_DEAD r463:SI REG_UNUSED flags:CC Instruction conversion gain: 8 Registers conversion cost: 4 Total gain: 4 Converting chain #40... to me the "spill" to (%rsp) looks suspicious and even more so the vector(!) memory use in vpminsd. RA could have used movd %eax, %xmm1 vpminsd %xmm1, %xmm0, %xmm1 no? IRA allocates the pseudo to memory. Testcase: extern int numBins; extern int binOffst; extern int binWidth; extern int Trybin; void foo (int); void bar (int aleft, int axcenter) { int a1LoBin = (((Trybin=((axcenter + aleft)-binOffst)/binWidth)<0) ? 0 : ((Trybin>numBins) ? numBins : Trybin)); foo (a1LoBin); } STV had emitted (insn 10 9 38 2 (parallel [ (set (reg:SI 93) (div:SI (reg:SI 92) ... (insn 38 10 12 2 (set (subreg:V4SI (reg:SI 98) 0) (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 93)) (const_vector:V4SI [ (const_int 0 [0]) repeated x4 ]) (const_int 1 [0x1]))) "t.c":9:56 -1 (nil)) ... (insn 39 31 32 2 (set (reg:SI 99) (mem/c:SI (symbol_ref:DI ("numBins") [flags 0x40] <var_decl 0x7f9a4bfdbab0 numBins>) [1 numBins+0 S4 A32])) "t.c":9:75 -1 (nil)) (insn 32 39 37 2 (set (subreg:V4SI (reg:SI 95) 0) (smin:V4SI (subreg:V4SI (reg:SI 98) 0) (subreg:V4SI (reg:SI 99) 0))) "t.c":9:75 3657 {*sse4_1_sminv4si3} (nil)) but then combine elimiated the copy... Trying 38 -> 32: 38: r98:SI#0=vec_merge(vec_duplicate(r93:SI),const_vector,0x1) 32: r95:SI#0=smin(r98:SI#0,r99:SI#0) REG_DEAD r99:SI REG_DEAD r98:SI Successfully matched this instruction: (set (subreg:V4SI (reg:SI 95) 0) (smin:V4SI (subreg:V4SI (reg:SI 93) 0) (subreg:V4SI (reg:SI 99 [ numBins ]) 0))) allowing combination of insns 38 and 32 original costs 4 + 40 = 44 replacement cost 40 ...running into the issue I tried to fix with making the live-range split copy more "explicit" ... So it looks like STV "forgot" to convert 'reg:SI 99' which is a memory load it split out. But even when fixing that combine now forwards both ops, eliminating the LR split again :/ Disabling combine results in the expected variant (not going through the stack): idivl binWidth(%rip) vmovd %eax, %xmm0 testl %eax, %eax movl %eax, Trybin(%rip) vpminsd %xmm1, %xmm0, %xmm0 vmovd %xmm0, %eax cmovns %eax, %edi jmp foo so while Index: gcc/config/i386/i386-features.c =================================================================== --- gcc/config/i386/i386-features.c (revision 274666) +++ gcc/config/i386/i386-features.c (working copy) @@ -910,7 +910,9 @@ general_scalar_chain::convert_op (rtx *o { rtx tmp = gen_reg_rtx (GET_MODE (*op)); - emit_insn_before (gen_move_insn (tmp, *op), insn); + emit_insn_before (gen_rtx_SET (gen_rtx_SUBREG (vmode, tmp, 0), + gen_gpr_to_xmm_move_src (vmode, *op)), + insn); *op = gen_rtx_SUBREG (vmode, tmp, 0); if (dump_file) will help to fence off regprop it doesn't help against combines power :/ (or IRAs failure to split live-ranges)