Re: More aggressive threading causing loop-interchange-9.c regression

Aldy Hernandez via Gcc Wed, 08 Sep 2021 03:45:10 -0700

First of all. Thanks so much for this incredibly useful explanation.It helps a lot.


I'm still trying to wrap my head around this, so please bear with me.


On 9/7/21 4:45 PM, Michael Matz wrote:

Hello,

On Tue, 7 Sep 2021, Aldy Hernandez via Gcc wrote:

The regression comes from the simple_reduc_1() function in
tree-ssa/loop-interchange-9.c, and it involves the following path:

=========== BB 5 ============
Imports: n_10(D)  j_19
Exports: n_10(D)  j_13  j_19
          j_13 : j_19(I)
n_10(D) int [1, 257]
j_19    int [0, 256]
Relational : (j_13 > j_19)
     <bb 5> [local count: 118111600]:
     # sum_21 = PHI <sum_14(4), sum_11(3)>
     c[j_19] = sum_21;
     j_13 = j_19 + 1;
     if (n_10(D) > j_13)
       goto <bb 9>; [89.00%]
     else
       goto <bb 6>; [11.00%]


So, this is the outer loops exit block ...

=========== BB 9 ============
n_10(D) int [2, 257]
j_13    int [1, 256]
Relational : (n_10(D) > j_19)
Relational : (n_10(D) > j_13)
     <bb 9> [local count: 105119324]:
     goto <bb 3>; [100.00%]


... this the outer loops latch block ...

=========== BB 3 ============
Imports: n_10(D)
Exports: n_10(D)
n_10(D) int [1, +INF]
     <bb 3> [local count: 118111600]:
     # j_19 = PHI <j_13(9), 0(7)>
     sum_11 = c[j_19];
     if (n_10(D) > 0)
       goto <bb 8>; [89.00%]
     else
       goto <bb 5>; [11.00%]


... and this the outer loops header, as well as inner loops pre-header.

Actually, the inner loop's pre-header seems to be BB8, which is empty,and leads into the BB4 which is the header of the inner loop.

The attempted thread hence involves a back-edge (of the outer loop) and a
loop-entry edge (bb3->bb8) of the inner loop.  Doing that threading would
destroy some properties that our loop optimizers rely on, e.g. that the
loop header of a natural loop is only reached by two edges: entry edge and
back edge, and that the latch blocks are empty and that there's only a
single latch.  (What exactly we require depends on several flags in
loops_state_satisfies_p).

But threading 3->8 doesn't involve a loop-entry edge. It involves a newedge into the *pre-header* of the inner loop.

I am attaching the IL right after threading and before we clean up theCFG (il-after-threading.txt).


After threading we have have rewritten the 5->9 edge into 5->11:

  <bb 5> [local count: 118111600]:
  # sum_21 = PHI <sum_14(4), sum_11(3)>
  c[j_19] = sum_21;
  j_13 = j_19 + 1;
  if (n_10(D) > j_13)
    goto <bb 11>; [89.00%]
  else
    goto <bb 6>; [11.00%]

  <bb 11> [local count: 105119324]:

  <bb 12> [local count: 105119324]:
  # j_7 = PHI <j_13(11)>
  sum_6 = c[j_19];
  goto <bb 8>; [100.00%]

  <bb 9> [local count: 0]:        ;; This becomes unreachable.
  goto <bb 3>; [100.00%]

Notice BB9 becomes unreachable, so we're basically eliding the latch in BB9.

Question: could BB12 not be considered the loop latch now and BB8 be theouter loop's entry? This would also mean, that BB3 is now outside ofthe outer loop, which means the threader peeled off the first iterationof the loop. Or is it a requirement that the latch be empty?

With knowledge from previous passes (SSA_NAME_RANGE_INFO), we know that
the loop cannot legally iterate outside the size of c[256].  So, j_13
lies within [1, 257] and n_10 is [2, +INF] at the end of the path.
This allows us to thread BB3 to BB8.


So, IIUC doing this threading would create a new entry to BB8: it would
then be entered by its natural entry (bb3), by its natural back edge
(whatever bb that is now) and the new entry from the threaded outer back
edge (bb9 or bb5?).

As I mentioned, BB8 looks like the pre-header to the inner loop. Butyes, it now has multiple entries: the edge from BB3, and the back edgefrom BB12 (which seems like it could be a new latch, since it's the onlyback edge).


The inner loop wouldn't then be recognized as natural anymore and the
whole nest not as perfect, and hence loop interchange can't easily happen
anymore.  Most other structural loop optimizers of us would have problems
with that situation as well.

If I clean up the CFG and call loop_optimizer_init() again at thispoint, what is destroyed is the outer loop, not the inner loop. If youlook at the IL at this point(il-after-cleanup-and-rerunning-loop-init.txt), we have a latch in BB7going back up to BB4, though the conditional is now in BB6 (eeech):


  <bb 4>
...
...
...


  <bb 6> [local count: 118111600]:
  # sum_21 = PHI <sum_14(5), sum_11(3)>
  # j_18 = PHI <j_17(5), j_19(3)>
  c[j_18] = sum_21;
  j_13 = j_18 + 1;
  if (n_10(D) > j_13)
    goto <bb 7>; [89.00%]
  else
    goto <bb 8>; [11.00%]

  <bb 7> [local count: 105119324]:
  # j_7 = PHI <j_13(6)>
  sum_6 = c[j_7];
  goto <bb 4>; [100.00%]

Perhaps, this looks sufficiently different that the loop optimizer can'trecognize it as a loop?

All the blocks lie within the same loop, and the path passes all the
tests in path_profitable_p().

Is there something about the shape of this path that should make it
invalid in the backward threader, or should the test be marked with
-fdisable-tree-threadN (or something else entirely)?


This is a functionality test checking that the very necessary interchange
in this test does happen with default plus -floop-interchange (with the
intention of it being enabled by O3 or with profile info).  So no
additional flags can be added without changing the meaning of this test.

Agreed. We're shuffling things around. What I'm missing here is whywe're unable to recognize this new form as a loop. That being said, Iam not above figuring out what the magic incantation is to keep thethreader from doing this transformation. However, I just want to makesure we're not papering over something else.

Note that the
backward threader is allowed to peel through loop headers.


Something needs to give way in the path threaders before loop
optimizations: either threading through back edges, through loop latches
or through loop headers needs to be disabled.  I think traditionally at
least threading through latches should be disabled, because doing so
usually destroys simple loop structure.  I see that profitable_path_p() of
the backward threader wants to be careful in some situations involving
loops and latches; possibly it's not careful enough yet for the additional
power brought by ranger.

See also the additional tests tree-cfgcleanup.c:tree_forwarder_block_p is
doing when loops are active.

After the structural loop optimizers the threader can go wild and thread
everything it finds.


Thanks again.  This is very helpful.

Aldy

__attribute__((noinline))
void simple_reduc_1 (int n)
{
  int i;
  int sum;
  int j;
  int _1;
  int _2;
  int _3;

  <bb 2> [local count: 14598063]:
  if (n_10(D) > 0)
    goto <bb 7>; [89.00%]
  else
    goto <bb 6>; [11.00%]

  <bb 7> [local count: 12992276]:

  <bb 3> [local count: 12992276]:
  # j_19 = PHI <j_13(9), 0(7)>
  sum_11 = c[j_19];
  if (n_10(D) > 0)
    goto <bb 8>; [66.92%]
  else
    goto <bb 5>; [33.08%]

  <bb 8> [local count: 105119324]:

  <bb 4> [local count: 955630225]:
  # sum_20 = PHI <sum_14(10), sum_11(8)>
  # i_22 = PHI <i_15(10), 0(8)>
  _1 = a[i_22][j_19];
  _2 = b[i_22][j_19];
  _3 = _1 * _2;
  sum_14 = _3 + sum_20;
  i_15 = i_22 + 1;
  if (n_10(D) > i_15)
    goto <bb 10>; [89.00%]
  else
    goto <bb 5>; [11.00%]

  <bb 10> [local count: 850510901]:
  goto <bb 4>; [100.00%]

  <bb 5> [local count: 118111600]:
  # sum_21 = PHI <sum_14(4), sum_11(3)>
  c[j_19] = sum_21;
  j_13 = j_19 + 1;
  if (n_10(D) > j_13)
    goto <bb 11>; [89.00%]
  else
    goto <bb 6>; [11.00%]

  <bb 11> [local count: 105119324]:

  <bb 12> [local count: 105119324]:
  # j_7 = PHI <j_13(11)>
  sum_6 = c[j_19];
  goto <bb 8>; [100.00%]

  <bb 9> [local count: 0]:
  goto <bb 3>; [100.00%]

  <bb 6> [local count: 14598063]:
  return;

}

void simple_reduc_1 (int n)
{
  int i;
  int sum;
  int j;
  int _1;
  int _2;
  int _3;

  <bb 2> [local count: 14598063]:
  if (n_10(D) > 0)
    goto <bb 3>; [89.00%]
  else
    goto <bb 8>; [11.00%]

  <bb 3> [local count: 12992276]:
  # j_19 = PHI <0(2)>
  sum_11 = c[j_19];
  if (n_10(D) > 0)
    goto <bb 4>; [66.92%]
  else
    goto <bb 6>; [33.08%]

  <bb 4> [local count: 105119324]:
  # sum_5 = PHI <sum_11(3), sum_6(7)>
  # j_17 = PHI <j_19(3), j_7(7)>

  <bb 5> [local count: 955630225]:
  # sum_20 = PHI <sum_14(9), sum_5(4)>
  # i_22 = PHI <i_15(9), 0(4)>
  _1 = a[i_22][j_17];
  _2 = b[i_22][j_17];
  _3 = _1 * _2;
  sum_14 = _3 + sum_20;
  i_15 = i_22 + 1;
  if (n_10(D) > i_15)
    goto <bb 9>; [89.00%]
  else
    goto <bb 6>; [11.00%]

  <bb 9> [local count: 850510901]:
  goto <bb 5>; [100.00%]

  <bb 6> [local count: 118111600]:
  # sum_21 = PHI <sum_14(5), sum_11(3)>
  # j_18 = PHI <j_17(5), j_19(3)>
  c[j_18] = sum_21;
  j_13 = j_18 + 1;
  if (n_10(D) > j_13)
    goto <bb 7>; [89.00%]
  else
    goto <bb 8>; [11.00%]

  <bb 7> [local count: 105119324]:
  # j_7 = PHI <j_13(6)>
  sum_6 = c[j_7];
  goto <bb 4>; [100.00%]

  <bb 8> [local count: 14598063]:
  return;

}

Re: More aggressive threading causing loop-interchange-9.c regression

Reply via email to