On Mon, Jun 14, 2021 at 12:02 AM Jeff Law via Gcc <gcc@gcc.gnu.org> wrote:
>
>
>
> On 6/9/2021 5:48 AM, Aldy Hernandez wrote:
> > Hi Jeff.  Hi folks.
> >
> > What started as a foray into severing the old (forward) threader's
> > dependency on evrp, turned into a rewrite of the backwards threader
> > code.  I'd like to discuss the possibility of replacing the current
> > backwards threader with a new one that gets far more threads and can
> > potentially subsume all threaders in the future.
> >
> > I won't include code here, as it will just detract from the high level
> > discussion.  But if it helps, I could post what I have, which just
> > needs some cleanups and porting to the latest trunk changes Andrew has
> > made.
> >
> > Currently the backwards threader works by traversing DEF chains
> > through PHIs leading to possible paths that start in a constant. When
> > such a path is found, it is checked to see if it is profitable, and if
> > so, the constant path is threaded.  The current implementation is
> > rather limited since backwards paths must end in a constant.  For
> > example, the backwards threader can't get any of the tests in
> > gcc.dg/tree-ssa/ssa-thread-14.c:
> >
> >   if (a && b)
> >     foo ();
> >   if (!b && c)
> >     bar ();
> >
> > etc.
> Right.  And these kinds of cases are particularly interesting to capture
> -- not only do you remove the runtime test/compare, all the setup code
> usually dies as well.  I can't remember who, but someone added some bits
> to detect these cases in DOM a while back and while the number of
> additional jumps threaded wasn't great, the overall impact was much
> better than we initially realized.   Instead of allowign removal of a
> single compare/branch, it typically allowed removal of a chain of
> logicals that fed the conditional.
>
> >
> > After my refactoring patches to the threading code, it is now possible
> > to drop in an alternate implementation that shares the profitability
> > code (is this path profitable?), the jump registry, and the actual
> > jump threading code.  I have leveraged this to write a ranger-based
> > threader that gets every single thread the current code gets, plus
> > 90-130% more.
> Sweet.
>
> >
> > Here are the details from the branch, which should be very similar to
> > trunk.  I'm presenting the branch numbers because they contain
> > Andrew's upcoming relational query which significantly juices up the
> > results.
> Yea, I'm not surprised that the relational query helps significantly
> here.  And I'm not surprised that we can do much better with the
> backwards threader with a rewrite.
>
> Much of the ranger design was with the idea behind using it in the
> backwards jump threader in mind.  Backwards threading is, IMHO, a much
> better way to think about the problem.  THe backwards threader also has
> a much stronger region copier -- so we don't have to live with the
> various limitations of the old jump threading approach.
>
>
>
> >
> > New threader:
> >          ethread:65043    (+3.06%)
> >          dom:32450      (-13.3%)
> >          backwards threader:72482   (+89.6%)
> >          vrp:40532      (-30.7%)
> >   Total threaded:  210507 (+6.70%)
> >
> > This means that the new code gets 89.6% more jump threading
> > opportunities than the code I want to replace.  In doing so, it
> > reduces the amount of DOM threading opportunities by 13.3% and by
> > 30.7% from the VRP jump threader.  The total  improvement across the
> > jump threading opportunities in the compiler is 6.70%.
> This looks good at first glance.  It's worth noting that the backwards
> threader runs before the others, so, yea, as it captures more stuff I
> would expect DOM/VRP to capture fewer things.    It would be interesting
> to know the breakdown of things caught by VRP1/VRP2 and how much of that
> is secondary opportunities that are only appearing because we've done a
> better job earlier.
>
> And just to be clear, I expect that we're going to leave some of those
> secondary opportunities on the table -- we just don't want it to be too
> many :-)  When I last looked at this my sense was wiring the backwards
> threader and ranger together should be enough to subsume VRP1/VRP2 jump
> threading.
>
> >
> > However, these are pessimistic numbers...
> >
> > I have noticed that some of the threading opportunities that DOM and
> > VRP now get are not because they're smarter, but because they're
> > picking up opportunities that the new code exposes.  I experimented
> > with running an iterative threader, and then seeing what VRP and DOM
> > could actually get.  This is too expensive to do in real life, but it
> > at least shows what the effect of the new code is on DOM/VRP's abilities:
> >
> >   Iterative threader:
> >     ethread:65043    (+3.06%)
> >     dom:31170    (-16.7%)
> >         thread:86717    (+127%)
> >         vrp:33851    (-42.2%)
> >   Total threaded:  216781 (+9.90%)
> >
> > This means that the new code not only gets 127% more cases, but it
> > reduces the DOM and VRP opportunities considerably (16.7% and 42.2%
> > respectively).   The end result is that we have the possibility of
> > getting almost 10% more jump threading opportunities in the entire
> > compilation run.
> >
> > (Note that the new code gets even more opportunities, but I'm only
> > reporting the profitable ones that made it all the way through to the
> > threader backend, and actually eliminated a branch.)
> Thanks for clarifying that.  It was one of the questions that first
> popped into my mind.
>
> >
> > The overall compilation hit from this work is currently 1.38% as
> > measured by callgrind.  We should be able to reduce this a bit, plus
> > we could get some of that back if we can replace the DOM and VRP
> > threaders (future work).
> Given how close we were to dropping the VRP threaders before, I would
> support dropping them at the same time.  That gives you a bit more
> compile-time budget.
>
> >
> > My proposed implementation should be able to get any threading
> > opportunity, and will get more as range-ops and ranger improve.
> >
> > I can go into the details if necessary, but the gist of it is that we
> > leverage the import facility in the ranger to only look up paths that
> > have a direct repercussion in the conditional being threaded, thus
> > reducing the search space.  This enhanced path discovery, plus an
> > engine to resolve conditionals based on knowledge from a CFG path, is
> > all that is needed to register new paths.  There is no limit to how
> > far back we look, though in practice, we stop looking once a path is
> > too expensive to continue the search in a given direction.
> Right.  That's one of the great things about Ranger -- it can
> dramatically reduce the search space.
>
> >
> > The solver API is simple:
> >
> > // This class is a thread path solver.  Given a set of BBs indicating
> > // a path through the CFG, range_in_path() will return the range
> > // of an SSA as if the BBs in the path would have been executed in
> > // order.
> > //
> > // Note that the blocks are in reverse order, thus the exit block is
> > path[0].
> >
> > class thread_solver : gori_compute
> > {
> >
> > public:
> >   thread_solver (gimple_ranger &ranger);
> >   virtual ~thread_solver ();
> >   void set_path (const vec<basic_block> *, const bitmap_head *imports);
> >   void range_in_path (irange &, tree name);
> >   void range_in_path (irange &, gimple *);
> > ...
> > };
> >
> > Basically, as we're discovering paths, we ask the solver what the
> > value of the final conditional in a BB is in a given path.  If it
> > resolves, we register the path.
> Exactly.  Given a path, do we know enough to resolve the conditional at
> the end.  If so register the path as a potential jump threading opportunity.
>
> >
> > A follow-up project would be to analyze what DOM/VRP are actually
> > getting that we don't, because in theory with an enhanced ranger, we
> > should be able to get everything they do (minus some float stuff, and
> > some CSE things DOM does).  However, IMO, this is good enough to at
> > least replace the current backwards threading code.
> I bet it's going to be tougher to remove DOM's threader.  It knows how
> to do thinks like resolve memory references using temporary equivalences
> and such.  But I bet it's enough to drop the VRP based threaders.

Yes.  In fact I am wondering if adding threading to the not iterating FRE
would make it possible to drop DOM, replacing it with instances of FRE.

Richard.

Reply via email to