Re: More aggressive threading causing loop-interchange-9.c regression

Jeff Law via Gcc Thu, 09 Sep 2021 09:55:15 -0700



On 9/9/2021 2:14 AM, Aldy Hernandez wrote:

On 9/8/21 8:13 PM, Michael Matz wrote:
Hello,

[lame answer to self]

On Wed, 8 Sep 2021, Michael Matz wrote:
The forward threader guards against this by simply disallowing
threadings that involve different loops.  As I see
The thread in question (5->9->3) is all within the same outer loop,
though. BTW, the backward threader also disallows threading across
different loops (see path_crosses_loops variable).
...
Maybe it's possible to not disable threading over latchesalltogether inthe backward threader (like it's tried now), but I haven't looked atthespecific situation here in depth, so take my view only as opinionfrom a
large distance :-)
I've now looked at the concrete situation. So yeah, the whole pathis in
the same loop, crosses the latch, _and there's code following the latch
on that path_.  (I.e. the latch isn't the last block in the path).  In
particular, after loop_optimizer_init() (before any threading) we have:

   <bb 3> [local count: 118111600]:
   # j_19 = PHI <j_13(9), 0(7)>
   sum_11 = c[j_19];
   if (n_10(D) > 0)
     goto <bb 8>; [89.00%]
   else
     goto <bb 5>; [11.00%]

      <bb 8> [local count: 105119324]:
...

   <bb 5> [local count: 118111600]:
   # sum_21 = PHI <sum_14(4), sum_11(3)>
   c[j_19] = sum_21;
   j_13 = j_19 + 1;
   if (n_10(D) > j_13)
     goto <bb 9>; [89.00%]
   else
     goto <bb 6>; [11.00%]

   <bb 9> [local count: 105119324]:
   goto <bb 3>; [100.00%]

With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
pre-header of inner loop, but more importantly something that's notat the
start of the outer loop.

Now, any thread that includes the backedge 9->3 _including_ its
destination (i.e. where the backedge isn't the last to-be-redirectededge)necessarily duplicates all code from that destination onto the backedge.
Here it's the load from c[j] into sum_11.

The important part is the code is emitted onto the back edge,
conceptually; in reality it's simply included into the (new) latch block
(the duplicate of bb9, which is bb12 intermediately, then named bb7after
cfg_cleanup).

That's what we can't have for some of our structural loop optimizers:
there must be no code executed after the exit test (e.g. in the latch
block).  (This requirement makes reasoning about which code is or isn't
executed completely for an iteration trivial; simply everything in the
body is always executed; e.g. loop interchange uses this to check that
there are no memory references after the exit test, because those would
then be only conditional and hence make loop interchange very awkward).
Note that this situation can't be later rectified anymore: theduplicated
instructions (because they are memory refs) must remain after the exit
test.  Only by rerolling/unrotating the loop (i.e. noticing that the
memory refs on the loop-entry path and on the back edge are equivalent)
would that be possible, but that's something we aren't capable of.  Even
if we were that would simply just revert the whole work that thethreader
did, so it's better to not even do that to start with.
I believe something like below would be appropriate, it disablesthreading
if the path contains a latch at the non-last position (due to being
backwards on the non-first position in the array).  I.e. it disables
rotating the loop if there's danger of polluting the back edge. It might
be improved if the blocks following (preceding!) the latch are themself
empty because then no code is duplicated.  It might also be improved if
the latch is already non-empty. That code should probably only beactive
before the loop optimizers, but currently the backward threader isn't
differentiating between before/after loop-optims.

I haven't tested this patch at all, except that it fixes the testcase :)
Thanks for looking at this.
I think you're onto something with this approach. Perhaps in additionto the loop header threading Richard mentions.
Your patch causes some regressions, but I think most are noise fromFSM tests that must be adjusted. However, there are some other onesthat are curious:
> FAIL: gcc.dg/tree-ssa/ldist-22.c scan-tree-dump ldist "generatedmemset zero"
> FAIL: gcc.dg/tree-ssa/pr66752-3.c scan-tree-dump-not dce2 "if .flag"
< XFAIL: gcc.dg/shrink-wrap-loop.c scan-rtl-dump pro_and_epilogue"Performing shrink-wrapping"< XFAIL: gcc.dg/Warray-bounds-87.c pr101671 (test for bogus messages,line 36)> FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-timesgraphite "1 loops carried no dependency" 1> FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-timesoptimized "loopfn.1" 4> FAIL: libgomp.graphite/force-parallel-8.c scan-tree-dump-timesgraphite "5 loops carried no dependency" 1
Interestingly your patch is fixing shrink-wrap-loop.c andWarray-bounds-87, both of which were introduced by the backwardthreader rewrite. At least the Warray-bounds was the threader peelingoff an iteration that caused a bogus warning.
The ldist-22 regression is interesting though:

void foo ()
{
  int i;

  <bb 2> :
  goto <bb 6>; [INV]

  <bb 3> :
  a[i_1] = 0;
  if (i_1 > 100)
    goto <bb 4>; [INV]
  else
    goto <bb 5>; [INV]

  <bb 4> :
  b[i_1] = i_1;

  <bb 5> :
  i_8 = i_1 + 1;

  <bb 6> :
  # i_1 = PHI <0(2), i_8(5)>
  if (i_1 <= 1023)
    goto <bb 3>; [INV]
  else
    goto <bb 7>; [INV]

  <bb 7> :
  return;

}
Here we fail to look past 5->6 because BB5 is the latch and is not thelast block in the path. So we fail to thread 3->5->6->3. Doing sowould have split the function into two loops, one of which could use amemset:
void foo ()
{
  int i;

  <bb 2> :
  goto <bb 6>; [INV]

  <bb 3> :
  # i_12 = PHI <i_1(6), i_9(4)>
  a[i_12] = 0;
  if (i_12 > 100)
    goto <bb 5>; [INV]
  else
    goto <bb 4>; [INV]

  <bb 4> :
  i_9 = i_12 + 1;
  goto <bb 3>; [100.00%]

  <bb 5> :
  b[i_12] = i_12;
  i_8 = i_12 + 1;

  <bb 6> :
  # i_1 = PHI <0(2), i_8(5)>
  if (i_1 <= 1023)
    goto <bb 3>; [INV]
  else
    goto <bb 7>; [INV]

  <bb 7> :
  return;

}
I would have to agree that threading through latches is problematic.For that matter, the ldist-22 test shows that we're depending on thethreader to do work that seems to belong in the loop optimizer world.
Would it be crazy to suggest that we disable threading through latchesaltogether, and do whatever we're missing in the loop world? It seemsloop has all the tools, cost model, and framework to do so. Ofcourse, I know 0 about loop, and would hate to add work to other'splates.

Threading through latches can be extremely problematical as they canchange the underlying loop structure. For example, it can take asimple loop and turn it into a loop nest.

I thought we'd already disabled threading through latches except for thecase where doing so allows us to thread through an indirect jump (ie,the original motivation for the FSM threader).

As I mentioned in our private email, threading through headers, throughlatches, across loops, etc has a number of negative effects and wegenerally want to avoid that, at least in the instances before loopoptimization and vectorization. THe threader doesn't have any costmodel for what is effectively loop peeling, loop rotatation, loop headercopying and the like. Zdenek argued those belong in the loopoptimizers, not jump threading and I broadly agree with that assessment.


Jeff

Re: More aggressive threading causing loop-interchange-9.c regression

Reply via email to