Re: More aggressive threading causing loop-interchange-9.c regression

Richard Biener via Gcc Wed, 08 Sep 2021 06:14:14 -0700

On Wed, Sep 8, 2021 at 12:44 PM Aldy Hernandez <al...@redhat.com> wrote:
>
> First of all.  Thanks so much for this incredibly useful explanation.
> It helps a lot.
>
> I'm still trying to wrap my head around this, so please bear with me.
>
> On 9/7/21 4:45 PM, Michael Matz wrote:
> > Hello,
> >
> > On Tue, 7 Sep 2021, Aldy Hernandez via Gcc wrote:
> >
> >> The regression comes from the simple_reduc_1() function in
> >> tree-ssa/loop-interchange-9.c, and it involves the following path:
> >>
> >> =========== BB 5 ============
> >> Imports: n_10(D)  j_19
> >> Exports: n_10(D)  j_13  j_19
> >>           j_13 : j_19(I)
> >> n_10(D)      int [1, 257]
> >> j_19 int [0, 256]
> >> Relational : (j_13 > j_19)
> >>      <bb 5> [local count: 118111600]:
> >>      # sum_21 = PHI <sum_14(4), sum_11(3)>
> >>      c[j_19] = sum_21;
> >>      j_13 = j_19 + 1;
> >>      if (n_10(D) > j_13)
> >>        goto <bb 9>; [89.00%]
> >>      else
> >>        goto <bb 6>; [11.00%]
> >
> > So, this is the outer loops exit block ...
> >
> >> =========== BB 9 ============
> >> n_10(D)      int [2, 257]
> >> j_13 int [1, 256]
> >> Relational : (n_10(D) > j_19)
> >> Relational : (n_10(D) > j_13)
> >>      <bb 9> [local count: 105119324]:
> >>      goto <bb 3>; [100.00%]
> >
> > ... this the outer loops latch block ...
> >
> >> =========== BB 3 ============
> >> Imports: n_10(D)
> >> Exports: n_10(D)
> >> n_10(D)      int [1, +INF]
> >>      <bb 3> [local count: 118111600]:
> >>      # j_19 = PHI <j_13(9), 0(7)>
> >>      sum_11 = c[j_19];
> >>      if (n_10(D) > 0)
> >>        goto <bb 8>; [89.00%]
> >>      else
> >>        goto <bb 5>; [11.00%]
> >
> > ... and this the outer loops header, as well as inner loops pre-header.
>
> Actually, the inner loop's pre-header seems to be BB8, which is empty,
> and leads into the BB4 which is the header of the inner loop.
>
> > The attempted thread hence involves a back-edge (of the outer loop) and a
> > loop-entry edge (bb3->bb8) of the inner loop.  Doing that threading would
> > destroy some properties that our loop optimizers rely on, e.g. that the
> > loop header of a natural loop is only reached by two edges: entry edge and
> > back edge, and that the latch blocks are empty and that there's only a
> > single latch.  (What exactly we require depends on several flags in
> > loops_state_satisfies_p).
>
> But threading 3->8 doesn't involve a loop-entry edge.  It involves a new
> edge into the *pre-header* of the inner loop.
>
> I am attaching the IL right after threading and before we clean up the
> CFG (il-after-threading.txt).
>
> After threading we have have rewritten the 5->9 edge into 5->11:
>
>    <bb 5> [local count: 118111600]:
>    # sum_21 = PHI <sum_14(4), sum_11(3)>
>    c[j_19] = sum_21;
>    j_13 = j_19 + 1;
>    if (n_10(D) > j_13)
>      goto <bb 11>; [89.00%]
>    else
>      goto <bb 6>; [11.00%]
>
>    <bb 11> [local count: 105119324]:
>
>    <bb 12> [local count: 105119324]:
>    # j_7 = PHI <j_13(11)>
>    sum_6 = c[j_19];
>    goto <bb 8>; [100.00%]
>
>    <bb 9> [local count: 0]:     ;; This becomes unreachable.
>    goto <bb 3>; [100.00%]
>
> Notice BB9 becomes unreachable, so we're basically eliding the latch in BB9.
>
> Question: could BB12 not be considered the loop latch now and BB8 be the
> outer loop's entry?  This would also mean, that BB3 is now outside of
> the outer loop, which means the threader peeled off the first iteration
> of the loop.  Or is it a requirement that the latch be empty?
>
> >
> >> With knowledge from previous passes (SSA_NAME_RANGE_INFO), we know that
> >> the loop cannot legally iterate outside the size of c[256].  So, j_13
> >> lies within [1, 257] and n_10 is [2, +INF] at the end of the path.
> >> This allows us to thread BB3 to BB8.
> >
> > So, IIUC doing this threading would create a new entry to BB8: it would
> > then be entered by its natural entry (bb3), by its natural back edge
> > (whatever bb that is now) and the new entry from the threaded outer back
> > edge (bb9 or bb5?).
>
> As I mentioned, BB8 looks like the pre-header to the inner loop.  But
> yes, it now has multiple entries: the edge from BB3, and the back edge
> from BB12 (which seems like it could be a new latch, since it's the only
> back edge).


It would be helpful to have the patch causing the issue to look at the IL.
But as Micha said, there needs to be a perfect loop nest for interchange
to work.

Richard.

> >
> > The inner loop wouldn't then be recognized as natural anymore and the
> > whole nest not as perfect, and hence loop interchange can't easily happen
> > anymore.  Most other structural loop optimizers of us would have problems
> > with that situation as well.
>
> If I clean up the CFG and call loop_optimizer_init() again at this
> point, what is destroyed is the outer loop, not the inner loop.  If you
> look at the IL at this point
> (il-after-cleanup-and-rerunning-loop-init.txt), we have a latch in BB7
> going back up to BB4, though the conditional is now in BB6 (eeech):
>
>    <bb 4>
> ...
> ...
> ...
>
>
>    <bb 6> [local count: 118111600]:
>    # sum_21 = PHI <sum_14(5), sum_11(3)>
>    # j_18 = PHI <j_17(5), j_19(3)>
>    c[j_18] = sum_21;
>    j_13 = j_18 + 1;
>    if (n_10(D) > j_13)
>      goto <bb 7>; [89.00%]
>    else
>      goto <bb 8>; [11.00%]
>
>    <bb 7> [local count: 105119324]:
>    # j_7 = PHI <j_13(6)>
>    sum_6 = c[j_7];
>    goto <bb 4>; [100.00%]
>
> Perhaps, this looks sufficiently different that the loop optimizer can't
> recognize it as a loop?
>
> >
> >> All the blocks lie within the same loop, and the path passes all the
> >> tests in path_profitable_p().
> >>
> >> Is there something about the shape of this path that should make it
> >> invalid in the backward threader, or should the test be marked with
> >> -fdisable-tree-threadN (or something else entirely)?
> >
> > This is a functionality test checking that the very necessary interchange
> > in this test does happen with default plus -floop-interchange (with the
> > intention of it being enabled by O3 or with profile info).  So no
> > additional flags can be added without changing the meaning of this test.
>
> Agreed.  We're shuffling things around.   What I'm missing here is why
> we're unable to recognize this new form as a loop.  That being said, I
> am not above figuring out what the magic incantation is to keep the
> threader from doing this transformation.  However, I just want to make
> sure we're not papering over something else.
>
> >
> >> Note that the
> >> backward threader is allowed to peel through loop headers.
> >
> > Something needs to give way in the path threaders before loop
> > optimizations: either threading through back edges, through loop latches
> > or through loop headers needs to be disabled.  I think traditionally at
> > least threading through latches should be disabled, because doing so
> > usually destroys simple loop structure.  I see that profitable_path_p() of
> > the backward threader wants to be careful in some situations involving
> > loops and latches; possibly it's not careful enough yet for the additional
> > power brought by ranger.
> >
> > See also the additional tests tree-cfgcleanup.c:tree_forwarder_block_p is
> > doing when loops are active.
> >
> > After the structural loop optimizers the threader can go wild and thread
> > everything it finds.
>
> Thanks again.  This is very helpful.
>
> Aldy

Re: More aggressive threading causing loop-interchange-9.c regression

Reply via email to