Hello, On Tue, 7 Sep 2021, Aldy Hernandez via Gcc wrote:
> The regression comes from the simple_reduc_1() function in > tree-ssa/loop-interchange-9.c, and it involves the following path: > > =========== BB 5 ============ > Imports: n_10(D) j_19 > Exports: n_10(D) j_13 j_19 > j_13 : j_19(I) > n_10(D) int [1, 257] > j_19 int [0, 256] > Relational : (j_13 > j_19) > <bb 5> [local count: 118111600]: > # sum_21 = PHI <sum_14(4), sum_11(3)> > c[j_19] = sum_21; > j_13 = j_19 + 1; > if (n_10(D) > j_13) > goto <bb 9>; [89.00%] > else > goto <bb 6>; [11.00%] So, this is the outer loops exit block ... > =========== BB 9 ============ > n_10(D) int [2, 257] > j_13 int [1, 256] > Relational : (n_10(D) > j_19) > Relational : (n_10(D) > j_13) > <bb 9> [local count: 105119324]: > goto <bb 3>; [100.00%] ... this the outer loops latch block ... > =========== BB 3 ============ > Imports: n_10(D) > Exports: n_10(D) > n_10(D) int [1, +INF] > <bb 3> [local count: 118111600]: > # j_19 = PHI <j_13(9), 0(7)> > sum_11 = c[j_19]; > if (n_10(D) > 0) > goto <bb 8>; [89.00%] > else > goto <bb 5>; [11.00%] ... and this the outer loops header, as well as inner loops pre-header. The attempted thread hence involves a back-edge (of the outer loop) and a loop-entry edge (bb3->bb8) of the inner loop. Doing that threading would destroy some properties that our loop optimizers rely on, e.g. that the loop header of a natural loop is only reached by two edges: entry edge and back edge, and that the latch blocks are empty and that there's only a single latch. (What exactly we require depends on several flags in loops_state_satisfies_p). > With knowledge from previous passes (SSA_NAME_RANGE_INFO), we know that > the loop cannot legally iterate outside the size of c[256]. So, j_13 > lies within [1, 257] and n_10 is [2, +INF] at the end of the path. > This allows us to thread BB3 to BB8. So, IIUC doing this threading would create a new entry to BB8: it would then be entered by its natural entry (bb3), by its natural back edge (whatever bb that is now) and the new entry from the threaded outer back edge (bb9 or bb5?). The inner loop wouldn't then be recognized as natural anymore and the whole nest not as perfect, and hence loop interchange can't easily happen anymore. Most other structural loop optimizers of us would have problems with that situation as well. > All the blocks lie within the same loop, and the path passes all the > tests in path_profitable_p(). > > Is there something about the shape of this path that should make it > invalid in the backward threader, or should the test be marked with > -fdisable-tree-threadN (or something else entirely)? This is a functionality test checking that the very necessary interchange in this test does happen with default plus -floop-interchange (with the intention of it being enabled by O3 or with profile info). So no additional flags can be added without changing the meaning of this test. > Note that the > backward threader is allowed to peel through loop headers. Something needs to give way in the path threaders before loop optimizations: either threading through back edges, through loop latches or through loop headers needs to be disabled. I think traditionally at least threading through latches should be disabled, because doing so usually destroys simple loop structure. I see that profitable_path_p() of the backward threader wants to be careful in some situations involving loops and latches; possibly it's not careful enough yet for the additional power brought by ranger. See also the additional tests tree-cfgcleanup.c:tree_forwarder_block_p is doing when loops are active. After the structural loop optimizers the threader can go wild and thread everything it finds. Ciao, Michael.