4.9 regression] unoptimal code for two simple loops

amker.cheng at gmail dot com Fri, 13 Dec 2013 03:02:06 -0800

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39838


--- Comment #16 from bin.cheng <amker.cheng at gmail dot com> ---
For optimization level O2, the dump before IVOPT is like:

  <bb 2>:
  _21 = p_6(D)->count;
  if (_21 > 0)
    goto <bb 3>;
  else
    goto <bb 11>;

  <bb 3>:

  <bb 4>:
  # i_26 = PHI <i_20(10), 0(3)>
  if (count_8(D) > 0)
    goto <bb 5>;
  else
    goto <bb 9>;

  <bb 5>:
  pretmp_23 = (sizetype) i_26;
  pretmp_32 = pretmp_23 + 1;
  pretmp_33 = pretmp_32 * 4;
  pretmp_34 = pretmp_23 * 4;

  <bb 6>:
  # j_27 = PHI <j_19(7), 0(5)>
  _9 = p_6(D)->data;
  _13 = _9 + pretmp_33;
  _14 = *_13;
  _16 = _9 + pretmp_34;
  _17 = *_16;
  func (_17, _14);
  j_19 = j_27 + 1;
  if (count_8(D) > j_19)
    goto <bb 7>;
  else
    goto <bb 8>;

  <bb 7>:
  goto <bb 6>;

  <bb 8>:

  <bb 9>:
  i_20 = i_26 + 1;
  _7 = p_6(D)->count;
  if (_7 > i_20)
    goto <bb 10>;
  else
    goto <bb 11>;

  <bb 10>:
  goto <bb 4>;

  <bb 11>:
  return;

There might be two issues that block IVOPT choosing the biv(i) for pretmp_33
and pretmp_34:
1) on some target (like ARM), "i << 2 + 4" can be done in one instruction, if
the cost is same as simple shift or plus, then overall cost of biv(i) is lower
than the two candidate iv sets.  GCC doesn't do such check in
get_computation_cost_at for now.
2) there is CSE opportunity between computation of pretmp_33 and pretmp_34, for
example they can be computed as below:
   pretmp_33 = i << 2
   pretmp_34 = pretmp_33 + 4
but GCC IVOPT is insensitive to such CSE opportunities between different iv
uses.  I guess this isn't easy because unless the two uses are very close in
code (like this one), such CSE may avail to nothing.

These kind tweaks on cost are tricky(and most probably has no overall benefit)
because the cost IVOPT computed from RTL is far from precise to do such fine
granularity tuning.

Another point, as Zdenek pointed out, IVOPT doesn't know that
pretmp_33/pretmp_34 are going to be used in memory accesses, which means some
of address computation can be embedded by appropriate addressing mode.  In
other words, computation of pretmp_33/pretmp_34 shouldn't be honored when
computing overall cost and choosing iv candidates set.  Since "_9 +
pretmp_33/pretmp_34" is not affine iv, the only way to handle this issue is to
lower both memory accesses before IVOPT, into some code like below:

  <bb 5>:
  pretmp_23 = (sizetype) i_26;
  pretmp_32 = pretmp_23 + 1;

  <bb 6>:
  # j_27 = PHI <j_19(7), 0(5)>
  _9 = p_6(D)->data;
  _14 = MEM[_9 + pretmp_32 << 2];
  _17 = MEM[_9 + pretmp_23 << 2];
  func (_17, _14);
  j_19 = j_27 + 1;
  if (count_8(D) > j_19)
    goto <bb 7>;
  else
    goto <bb 8>;

With this code, the iv uses are biv(i), pretmp_23(i_26) and pretmp_32(i_26+1),
and IVOPT won't even add the annoying candidate.

[Bug middle-end/39838] [4.7/4.8/4.9 regression] unoptimal code for two simple loops

Reply via email to