[Bug tree-optimization/52272] [4.7 regression] Performance regresswion of 410.bwaves on x86.

rguenth at gcc dot gnu.org Thu, 16 Feb 2012 04:59:04 -0800

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52272


--- Comment #4 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-02-16 
12:40:06 UTC ---
Before the patch we choose

Improved to:
  cost: 128 (complexity 0)
  cand_cost: 19
  cand_use_cost: 28 (complexity 0)
  candidates: 2, 4, 7
   use:0 --> iv_cand:4, cost=(2,0)
   use:1 --> iv_cand:4, cost=(2,0)
   use:2 --> iv_cand:2, cost=(0,0)
   use:3 --> iv_cand:7, cost=(0,0)
   use:4 --> iv_cand:7, cost=(4,0)
   use:5 --> iv_cand:7, cost=(4,0)
   use:6 --> iv_cand:7, cost=(4,0)
   use:7 --> iv_cand:7, cost=(4,0)
   use:8 --> iv_cand:7, cost=(4,0)
   use:9 --> iv_cand:7, cost=(4,0)

and now we do not consider for example candidate 7 for use 4:

candidate 7
  var_before ivtmp.190
  var_after ivtmp.190
  incremented before exit test
  type character(kind=4)
  base (character(kind=4)) (a_296(D) + (((sizetype) stride.88_9 + (sizetype)
pretmp.141_661) + 1) * 8)
  step 8
  base object (void *) a_296(D)

use 4
  generic
  in statement D.2322_387 = axp_318(D) + D.2321_367;

  at position
  type real(kind=8)[0:D.1963] * restrict
  base axp_318(D) + (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1)
* 8
  step 8
  base object (void *) axp_318(D)
  related candidates

and we really do not want to do that because of the wrong-code issue.
We instead end up with

Improved to:
  cost: 133 (complexity 7)
  cand_cost: 13
  cand_use_cost: 39 (complexity 7)
  candidates: 4, 5
   use:0 --> iv_cand:4, cost=(2,0)
   use:1 --> iv_cand:4, cost=(2,0)
   use:2 --> iv_cand:5, cost=(0,0)
   use:3 --> iv_cand:5, cost=(5,1)
   use:4 --> iv_cand:5, cost=(5,1)
   use:5 --> iv_cand:5, cost=(5,1)
   use:6 --> iv_cand:5, cost=(5,1)
   use:7 --> iv_cand:5, cost=(5,1)
   use:8 --> iv_cand:5, cost=(5,1)
   use:9 --> iv_cand:5, cost=(5,1)

where

candidate 5 (important)
  var_before ivtmp.188 
  var_after ivtmp.188
  incremented before exit test
  type sizetype
  base 0
  step 8

I think what we miss to relate uses 4 to 9 which all are of the form
 base <parameter> + (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1)
* 8
is to have a candidate which has the base object stripped and thus
only tracks
 (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1) * 8
which we have as IV at least:
ssa name D.2332_451
  type sizetype
  base (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1) * 8
  step 8
and redundant:
ssa name D.2354_680
  type sizetype
  base (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1) * 8
  step 8
ssa name D.2343_692
  type sizetype
  base (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1) * 8
  step 8
ssa name D.2365_752
  type sizetype
  base (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1) * 8
  step 8
ssa name D.2376_763
  type sizetype
  base (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1) * 8
  step 8
but no associated candidate(s).  If we add a candidate for it (9) we
end up with

Improved to:
  cost: 131 (complexity 0)
  cand_cost: 15
  cand_use_cost: 35 (complexity 0)
  candidates: 4, 9
   use:0 --> iv_cand:4, cost=(2,0)
   use:1 --> iv_cand:4, cost=(2,0)
   use:2 --> iv_cand:9, cost=(3,0)
   use:3 --> iv_cand:9, cost=(4,0)
   use:4 --> iv_cand:9, cost=(4,0)
   use:5 --> iv_cand:9, cost=(4,0)
   use:6 --> iv_cand:9, cost=(4,0)
   use:7 --> iv_cand:9, cost=(4,0)
   use:8 --> iv_cand:9, cost=(4,0)
   use:9 --> iv_cand:9, cost=(4,0)

but with that change we now unroll the innermost loop twice, so I'm not
sure it will pay off.  The code generation differences even for the
originally patch that caused the regression are only in scheduling
and register allocation (so -fschedule-insns may recover it, or
-fsched-pressure).

[Bug tree-optimization/52272] [4.7 regression] Performance regresswion of 410.bwaves on x86.

Reply via email to