https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154

--- Comment #25 from rguenther at suse dot de <rguenther at suse dot de> ---
On Sat, 17 Aug 2019, ubizjak at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154
> 
> --- Comment #24 from Uroš Bizjak <ubizjak at gmail dot com> ---
> It looks that the patch introduced a (small?) runtime regression of 5% in
> SPEC2000 300.twolf on haswell [1]. Maybe worth looking at.

Biggest changes when benchmarking -mno-stv (base) against -mstv (peak):

   7.28%         37789  twolf_peak.none  twolf_peak.none   [.] ucxx2 
   4.21%         25709  twolf_base.none  twolf_base.none   [.] ucxx2        
   3.72%         22584  twolf_base.none  twolf_base.none   [.] new_dbox
   2.48%         22364  twolf_peak.none  twolf_peak.none   [.] new_dbox
   1.49%          8270  twolf_base.none  twolf_base.none   [.] sub_penal
   1.12%          7576  twolf_peak.none  twolf_peak.none   [.] sub_penal
   1.36%          9314  twolf_peak.none  twolf_peak.none   [.] old_assgnto_new2
   1.11%          5257  twolf_base.none  twolf_base.none   [.] old_assgnto_new2

and in ucxx2 I see

  0.17 │       mov    %eax,(%rsp)
  3.55 │       vpmins (%rsp),%xmm0,%xmm1   
       │       test   %eax,%eax
  0.22 │       vmovd  %xmm1,%r8d              
  0.80 │       cmovs  %esi,%r8d

This is from code like

  a1LoBin = Trybin/binWidth < 0 ? 0 : (Trybin>numBins ? numBins : Trybin)

with only the inner one recognized as MIN because 'numBins' is only
ever loaded conditionally and we don't speculate it.  So we expand
from

  _41 = _40 / binWidth.15_36;
  if (_41 >= 0)
    goto <bb 5>; [59.00%]
  else
    goto <bb 6>; [41.00%]

bb5:
  numBins.26_42 = numBins;
  iftmp.19_315 = MIN_EXPR <_41, numBins.26_42>;

bb6:
  # iftmp.19_267 = PHI <iftmp.19_315(5), 0(4)>

ending up with

        movl    %r9d, %eax
        cltd
        idivl   %ecx
        movl    %eax, (%rsp)
        vpminsd (%rsp), %xmm0, %xmm1
        testl   %eax, %eax
        vmovd   %xmm1, %r11d
        cmovs   %esi, %r11d

and STV converting single-instruction 'chains':

Collected chain #40... 
  insns: 381
  defs to convert: r463, r465
Computing gain for chain #40...
  Instruction gain 8 for   381: {r465:SI=smin(r463:SI,[`numBins']);clobber 
flags:CC;}
      REG_DEAD r463:SI
      REG_UNUSED flags:CC
  Instruction conversion gain: 8 
  Registers conversion cost: 4
  Total gain: 4
Converting chain #40...

to me the "spill" to (%rsp) looks suspicious and even more so
the vector(!) memory use in vpminsd.  RA could have used

  movd  %eax, %xmm1
  vpminsd %xmm1, %xmm0, %xmm1

no?  IRA allocates the pseudo to memory.  Testcase:

extern int numBins;
extern int binOffst;
extern int binWidth;
extern int Trybin;
void foo (int);

void bar (int aleft, int axcenter)
{
  int a1LoBin = (((Trybin=((axcenter + aleft)-binOffst)/binWidth)<0)
                 ? 0 : ((Trybin>numBins) ? numBins : Trybin));
  foo (a1LoBin);
}

STV had emitted

(insn 10 9 38 2 (parallel [
            (set (reg:SI 93)
                (div:SI (reg:SI 92)
...
(insn 38 10 12 2 (set (subreg:V4SI (reg:SI 98) 0)
        (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 93))
            (const_vector:V4SI [
                    (const_int 0 [0]) repeated x4
                ])
            (const_int 1 [0x1]))) "t.c":9:56 -1
     (nil))
...
(insn 39 31 32 2 (set (reg:SI 99)
        (mem/c:SI (symbol_ref:DI ("numBins") [flags 0x40]  <var_decl 
0x7f9a4bfdbab0 numBins>) [1 numBins+0 S4 A32])) "t.c":9:75 -1
     (nil))
(insn 32 39 37 2 (set (subreg:V4SI (reg:SI 95) 0)
        (smin:V4SI (subreg:V4SI (reg:SI 98) 0)
            (subreg:V4SI (reg:SI 99) 0))) "t.c":9:75 3657 
{*sse4_1_sminv4si3}
     (nil))

but then combine elimiated the copy...

Trying 38 -> 32:
   38: r98:SI#0=vec_merge(vec_duplicate(r93:SI),const_vector,0x1)
   32: r95:SI#0=smin(r98:SI#0,r99:SI#0)
      REG_DEAD r99:SI
      REG_DEAD r98:SI
Successfully matched this instruction:
(set (subreg:V4SI (reg:SI 95) 0)
    (smin:V4SI (subreg:V4SI (reg:SI 93) 0)
        (subreg:V4SI (reg:SI 99 [ numBins ]) 0)))
allowing combination of insns 38 and 32
original costs 4 + 40 = 44
replacement cost 40

...running into the issue I tried to fix with making the live-range
split copy more "explicit" ...

So it looks like STV "forgot" to convert 'reg:SI 99' which is
a memory load it split out.  But even when fixing that combine
now forwards both ops, eliminating the LR split again :/

Disabling combine results in the expected variant (not going
through the stack):

        idivl   binWidth(%rip)
        vmovd   %eax, %xmm0
        testl   %eax, %eax
        movl    %eax, Trybin(%rip)
        vpminsd %xmm1, %xmm0, %xmm0
        vmovd   %xmm0, %eax
        cmovns  %eax, %edi
        jmp     foo

so while

Index: gcc/config/i386/i386-features.c
===================================================================
--- gcc/config/i386/i386-features.c     (revision 274666)
+++ gcc/config/i386/i386-features.c     (working copy)
@@ -910,7 +910,9 @@ general_scalar_chain::convert_op (rtx *o
     {
       rtx tmp = gen_reg_rtx (GET_MODE (*op));

-      emit_insn_before (gen_move_insn (tmp, *op), insn);
+      emit_insn_before (gen_rtx_SET (gen_rtx_SUBREG (vmode, tmp, 0),
+                                    gen_gpr_to_xmm_move_src (vmode, 
*op)),
+                       insn);
       *op = gen_rtx_SUBREG (vmode, tmp, 0);

       if (dump_file)

will help to fence off regprop it doesn't help against combines
power :/  (or IRAs failure to split live-ranges)

Reply via email to