https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120003
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rguenth at gcc dot gnu.org Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org --- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Richard Biener from comment #6) > (In reply to Andrew Macleod from comment #4) > > This seems to be the issue? > > > > <bb 4> [local count: 350791453]: > > _1 = g (i_3); > > if (_1 != 0) > > goto <bb 5>; [50.00%] > > else > > goto <bb 6>; [50.00%] > > > > <bb 5> [local count: 175395727]: > > > > <bb 6> [local count: 1063004408]: > > # iftmp.0_4 = PHI <1(3), 0(4), 1(5)> > > > > That 3 way PHI isn't used in any threads, so we don't get a threaded path > > to the eventual return of 1. > > The irreducible check is at least badly named - as written it does not > make the containing loop irreducible, instead it partly unrolls things. > > But with that fixed we still reject the path in > jt_path_registry::cancel_invalid_paths by > > 2840 cancel_thread (&path, "Path crosses loop header but does not > exit it"); > > which is true again. We can allow another subset of threads, but this > then enables the > > path: 9->6->7->3->6 > > path which just duplicates one iteration which does not help. > > We need to create a subloop or sibling loop w/o the call. I don't see > offhand why this doesn't work - but then isolating a path will never > create a new loop(?) > > I've played with the following. > > diff --git a/gcc/tree-ssa-threadbackward.cc b/gcc/tree-ssa-threadbackward.cc > index 23bfc14c8f0..2603d27f1f3 100644 > --- a/gcc/tree-ssa-threadbackward.cc > +++ b/gcc/tree-ssa-threadbackward.cc > @@ -789,6 +789,7 @@ back_threader_profitability::profitable_path_p (const > vec<basic_block> &m_path, > *creates_irreducible_loop = false; > if (m_threaded_through_latch > && loop == taken_edge->dest->loop_father > + && taken_edge->dest != m_path[m_path.length () - 2] > && (determine_bb_domination_status (loop, taken_edge->dest) > == DOMST_NONDOMINATING)) > *creates_irreducible_loop = true; > diff --git a/gcc/tree-ssa-threadupdate.cc b/gcc/tree-ssa-threadupdate.cc > index 4e5c7566857..d91c0c7bf20 100644 > --- a/gcc/tree-ssa-threadupdate.cc > +++ b/gcc/tree-ssa-threadupdate.cc > @@ -2811,6 +2811,10 @@ jt_path_registry::cancel_invalid_paths > (vec<jump_thread_edge *> &path) > && flow_loop_nested_p (exit->dest->loop_father, > exit->src->loop_father)) > return false; > > + // If we thread a whole loop round-trip, we are just creating a subloop > + if (entry->dest == exit->dest) > + return false; > + > if (cfun->curr_properties & PROP_loop_opts_done) > return false; Note this patch ends up restoring the optimization, just the threading itself isn't it. Instead thread2 forms the inner loop and threadfull2 then makes it a sibling loop which cddce3 can elide. So a quite complicated dance, threadfull, thread, threadfull. The question is why we need to iterate here and whether we can do better here. After loop opts we only have one threadfull instance. In particular disabling thread2 makes threadfull2 form the inner loop and we lose. Disabling threadfull1 (with the above patch) makes neither pass do any threading (not even the one I got threadfull1 to do), possibly because the loop was rotated by header copying to the following and there we don't seem to try the cross-iteration invariance of (retval_15 != 0) == true, or rather it's possibly the lack of a forwarder for the 3->5 edge and that we're basic-block based, we only consider '3' once. <bb 3> [local count: 1063004408]: # retval_15 = PHI <prephitmp_16(7), 0(2)> # i_17 = PHI <i_11(7), 0(2)> if (retval_15 != 0) goto <bb 5>; [67.00%] else goto <bb 4>; [33.00%] <bb 4> [local count: 350791453]: _1 = g (i_17); <bb 5> [local count: 1063004408]: # prephitmp_16 = PHI <1(3), _1(4)> i_11 = i_17 + 1; if (i_11 != 1000000) goto <bb 7>; [98.99%] else goto <bb 6>; [1.01%] <bb 7> [local count: 1052266995]: goto <bb 3>; [100.00%] <bb 6> [local count: 10737416]: return prephitmp_16; In fact fixing that fixes the regression with the help of threadfull2 + vrp2.