On Wed, Sep 8, 2021 at 12:44 PM Aldy Hernandez <al...@redhat.com> wrote: > > First of all. Thanks so much for this incredibly useful explanation. > It helps a lot. > > I'm still trying to wrap my head around this, so please bear with me. > > On 9/7/21 4:45 PM, Michael Matz wrote: > > Hello, > > > > On Tue, 7 Sep 2021, Aldy Hernandez via Gcc wrote: > > > >> The regression comes from the simple_reduc_1() function in > >> tree-ssa/loop-interchange-9.c, and it involves the following path: > >> > >> =========== BB 5 ============ > >> Imports: n_10(D) j_19 > >> Exports: n_10(D) j_13 j_19 > >> j_13 : j_19(I) > >> n_10(D) int [1, 257] > >> j_19 int [0, 256] > >> Relational : (j_13 > j_19) > >> <bb 5> [local count: 118111600]: > >> # sum_21 = PHI <sum_14(4), sum_11(3)> > >> c[j_19] = sum_21; > >> j_13 = j_19 + 1; > >> if (n_10(D) > j_13) > >> goto <bb 9>; [89.00%] > >> else > >> goto <bb 6>; [11.00%] > > > > So, this is the outer loops exit block ... > > > >> =========== BB 9 ============ > >> n_10(D) int [2, 257] > >> j_13 int [1, 256] > >> Relational : (n_10(D) > j_19) > >> Relational : (n_10(D) > j_13) > >> <bb 9> [local count: 105119324]: > >> goto <bb 3>; [100.00%] > > > > ... this the outer loops latch block ... > > > >> =========== BB 3 ============ > >> Imports: n_10(D) > >> Exports: n_10(D) > >> n_10(D) int [1, +INF] > >> <bb 3> [local count: 118111600]: > >> # j_19 = PHI <j_13(9), 0(7)> > >> sum_11 = c[j_19]; > >> if (n_10(D) > 0) > >> goto <bb 8>; [89.00%] > >> else > >> goto <bb 5>; [11.00%] > > > > ... and this the outer loops header, as well as inner loops pre-header. > > Actually, the inner loop's pre-header seems to be BB8, which is empty, > and leads into the BB4 which is the header of the inner loop. > > > The attempted thread hence involves a back-edge (of the outer loop) and a > > loop-entry edge (bb3->bb8) of the inner loop. Doing that threading would > > destroy some properties that our loop optimizers rely on, e.g. that the > > loop header of a natural loop is only reached by two edges: entry edge and > > back edge, and that the latch blocks are empty and that there's only a > > single latch. (What exactly we require depends on several flags in > > loops_state_satisfies_p). > > But threading 3->8 doesn't involve a loop-entry edge. It involves a new > edge into the *pre-header* of the inner loop. > > I am attaching the IL right after threading and before we clean up the > CFG (il-after-threading.txt). > > After threading we have have rewritten the 5->9 edge into 5->11: > > <bb 5> [local count: 118111600]: > # sum_21 = PHI <sum_14(4), sum_11(3)> > c[j_19] = sum_21; > j_13 = j_19 + 1; > if (n_10(D) > j_13) > goto <bb 11>; [89.00%] > else > goto <bb 6>; [11.00%] > > <bb 11> [local count: 105119324]: > > <bb 12> [local count: 105119324]: > # j_7 = PHI <j_13(11)> > sum_6 = c[j_19]; > goto <bb 8>; [100.00%] > > <bb 9> [local count: 0]: ;; This becomes unreachable. > goto <bb 3>; [100.00%] > > Notice BB9 becomes unreachable, so we're basically eliding the latch in BB9. > > Question: could BB12 not be considered the loop latch now and BB8 be the > outer loop's entry? This would also mean, that BB3 is now outside of > the outer loop, which means the threader peeled off the first iteration > of the loop. Or is it a requirement that the latch be empty? > > > > >> With knowledge from previous passes (SSA_NAME_RANGE_INFO), we know that > >> the loop cannot legally iterate outside the size of c[256]. So, j_13 > >> lies within [1, 257] and n_10 is [2, +INF] at the end of the path. > >> This allows us to thread BB3 to BB8. > > > > So, IIUC doing this threading would create a new entry to BB8: it would > > then be entered by its natural entry (bb3), by its natural back edge > > (whatever bb that is now) and the new entry from the threaded outer back > > edge (bb9 or bb5?). > > As I mentioned, BB8 looks like the pre-header to the inner loop. But > yes, it now has multiple entries: the edge from BB3, and the back edge > from BB12 (which seems like it could be a new latch, since it's the only > back edge).
It would be helpful to have the patch causing the issue to look at the IL. But as Micha said, there needs to be a perfect loop nest for interchange to work. Richard. > > > > The inner loop wouldn't then be recognized as natural anymore and the > > whole nest not as perfect, and hence loop interchange can't easily happen > > anymore. Most other structural loop optimizers of us would have problems > > with that situation as well. > > If I clean up the CFG and call loop_optimizer_init() again at this > point, what is destroyed is the outer loop, not the inner loop. If you > look at the IL at this point > (il-after-cleanup-and-rerunning-loop-init.txt), we have a latch in BB7 > going back up to BB4, though the conditional is now in BB6 (eeech): > > <bb 4> > ... > ... > ... > > > <bb 6> [local count: 118111600]: > # sum_21 = PHI <sum_14(5), sum_11(3)> > # j_18 = PHI <j_17(5), j_19(3)> > c[j_18] = sum_21; > j_13 = j_18 + 1; > if (n_10(D) > j_13) > goto <bb 7>; [89.00%] > else > goto <bb 8>; [11.00%] > > <bb 7> [local count: 105119324]: > # j_7 = PHI <j_13(6)> > sum_6 = c[j_7]; > goto <bb 4>; [100.00%] > > Perhaps, this looks sufficiently different that the loop optimizer can't > recognize it as a loop? > > > > >> All the blocks lie within the same loop, and the path passes all the > >> tests in path_profitable_p(). > >> > >> Is there something about the shape of this path that should make it > >> invalid in the backward threader, or should the test be marked with > >> -fdisable-tree-threadN (or something else entirely)? > > > > This is a functionality test checking that the very necessary interchange > > in this test does happen with default plus -floop-interchange (with the > > intention of it being enabled by O3 or with profile info). So no > > additional flags can be added without changing the meaning of this test. > > Agreed. We're shuffling things around. What I'm missing here is why > we're unable to recognize this new form as a loop. That being said, I > am not above figuring out what the magic incantation is to keep the > threader from doing this transformation. However, I just want to make > sure we're not papering over something else. > > > > >> Note that the > >> backward threader is allowed to peel through loop headers. > > > > Something needs to give way in the path threaders before loop > > optimizations: either threading through back edges, through loop latches > > or through loop headers needs to be disabled. I think traditionally at > > least threading through latches should be disabled, because doing so > > usually destroys simple loop structure. I see that profitable_path_p() of > > the backward threader wants to be careful in some situations involving > > loops and latches; possibly it's not careful enough yet for the additional > > power brought by ranger. > > > > See also the additional tests tree-cfgcleanup.c:tree_forwarder_block_p is > > doing when loops are active. > > > > After the structural loop optimizers the threader can go wild and thread > > everything it finds. > > Thanks again. This is very helpful. > > Aldy