Re: replacing the backwards threader and more

Richard Biener via Gcc Wed, 09 Jun 2021 09:19:01 -0700

On June 9, 2021 5:34:03 PM GMT+02:00, Aldy Hernandez <al...@redhat.com> wrote:
>
>
>On 6/9/21 2:09 PM, Richard Biener wrote:
>> On Wed, Jun 9, 2021 at 1:50 PM Aldy Hernandez via Gcc
><gcc@gcc.gnu.org> wrote:
>>>
>>> Hi Jeff.  Hi folks.
>>>
>>> What started as a foray into severing the old (forward) threader's
>>> dependency on evrp, turned into a rewrite of the backwards threader
>>> code.  I'd like to discuss the possibility of replacing the current
>>> backwards threader with a new one that gets far more threads and can
>>> potentially subsume all threaders in the future.
>>>
>>> I won't include code here, as it will just detract from the high
>level
>>> discussion.  But if it helps, I could post what I have, which just
>needs
>>> some cleanups and porting to the latest trunk changes Andrew has
>made.
>>>
>>> Currently the backwards threader works by traversing DEF chains
>through
>>> PHIs leading to possible paths that start in a constant.  When such
>a
>>> path is found, it is checked to see if it is profitable, and if so,
>the
>>> constant path is threaded.  The current implementation is rather
>limited
>>> since backwards paths must end in a constant.  For example, the
>>> backwards threader can't get any of the tests in
>>> gcc.dg/tree-ssa/ssa-thread-14.c:
>>>
>>>     if (a && b)
>>>       foo ();
>>>     if (!b && c)
>>>       bar ();
>>>
>>> etc.
>>>
>>> After my refactoring patches to the threading code, it is now
>possible
>>> to drop in an alternate implementation that shares the profitability
>>> code (is this path profitable?), the jump registry, and the actual
>jump
>>> threading code.  I have leveraged this to write a ranger-based
>threader
>>> that gets every single thread the current code gets, plus 90-130%
>more.
>>>
>>> Here are the details from the branch, which should be very similar
>to
>>> trunk.  I'm presenting the branch numbers because they contain
>Andrew's
>>> upcoming relational query which significantly juices up the results.
>>>
>>> New threader:
>>>            ethread:65043    (+3.06%)
>>>            dom:32450      (-13.3%)
>>>            backwards threader:72482   (+89.6%)
>>>            vrp:40532      (-30.7%)
>>>     Total threaded:  210507 (+6.70%)
>>>
>>> This means that the new code gets 89.6% more jump threading
>>> opportunities than the code I want to replace.  In doing so, it
>reduces
>>> the amount of DOM threading opportunities by 13.3% and by 30.7% from
>the
>>> VRP jump threader.  The total  improvement across the jump threading
>>> opportunities in the compiler is 6.70%.
>>>
>>> However, these are pessimistic numbers...
>>>
>>> I have noticed that some of the threading opportunities that DOM and
>VRP
>>> now get are not because they're smarter, but because they're picking
>up
>>> opportunities that the new code exposes.  I experimented with
>running an
>>> iterative threader, and then seeing what VRP and DOM could actually
>get.
>>>    This is too expensive to do in real life, but it at least shows
>what
>>> the effect of the new code is on DOM/VRP's abilities:
>>>
>>>     Iterative threader:
>>>       ethread:65043    (+3.06%)
>>>       dom:31170    (-16.7%)
>>>           thread:86717    (+127%)
>>>           vrp:33851    (-42.2%)
>>>     Total threaded:  216781 (+9.90%)
>>>
>>> This means that the new code not only gets 127% more cases, but it
>>> reduces the DOM and VRP opportunities considerably (16.7% and 42.2%
>>> respectively).   The end result is that we have the possibility of
>>> getting almost 10% more jump threading opportunities in the entire
>>> compilation run.
>> 
>> Yeah, DOM once was iterating ...
>> 
>> You probably have noticed that we have very man (way too many)
>> 'thread' passes, often in close succession with each other or
>> DOM or VRP.  So in the above numbers I wonder if you can break
>> down the numbers individually for the actual passes (in their order)?
>
>Sure, I can do that.  Let me whip up the old branch and gather some
>info.
>
>> 
>>> (Note that the new code gets even more opportunities, but I'm only
>>> reporting the profitable ones that made it all the way through to
>the
>>> threader backend, and actually eliminated a branch.)
>>>
>>> The overall compilation hit from this work is currently 1.38% as
>>> measured by callgrind.  We should be able to reduce this a bit, plus
>we
>>> could get some of that back if we can replace the DOM and VRP
>threaders
>>> (future work).
>>>
>>> My proposed implementation should be able to get any threading
>>> opportunity, and will get more as range-ops and ranger improve.
>>>
>>> I can go into the details if necessary, but the gist of it is that
>we
>>> leverage the import facility in the ranger to only look up paths
>that
>>> have a direct repercussion in the conditional being threaded, thus
>>> reducing the search space.  This enhanced path discovery, plus an
>engine
>>> to resolve conditionals based on knowledge from a CFG path, is all
>that
>>> is needed to register new paths.  There is no limit to how far back
>we
>>> look, though in practice, we stop looking once a path is too
>expensive
>>> to continue the search in a given direction.
>>>
>>> The solver API is simple:
>>>
>>> // This class is a thread path solver.  Given a set of BBs
>indicating
>>> // a path through the CFG, range_in_path() will return the range
>>> // of an SSA as if the BBs in the path would have been executed in
>>> // order.
>>> //
>>> // Note that the blocks are in reverse order, thus the exit block is
>>> path[0].
>>>
>>> class thread_solver : gori_compute
>>> {
>>>
>>> public:
>>>     thread_solver (gimple_ranger &ranger);
>>>     virtual ~thread_solver ();
>>>     void set_path (const vec<basic_block> *, const bitmap_head
>*imports);
>>>     void range_in_path (irange &, tree name);
>>>     void range_in_path (irange &, gimple *);
>>> ...
>>> };
>>>
>>> Basically, as we're discovering paths, we ask the solver what the
>value
>>> of the final conditional in a BB is in a given path.  If it
>resolves, we
>>> register the path.
>>>
>>> A follow-up project would be to analyze what DOM/VRP are actually
>>> getting that we don't, because in theory with an enhanced ranger, we
>>> should be able to get everything they do (minus some float stuff,
>and
>>> some CSE things DOM does).  However, IMO, this is good enough to at
>>> least replace the current backwards threading code.
>>>
>>> My suggestion would be to keep both implementations, defaulting to
>the
>>> ranger based, and running the old code immediately after-- trapping
>if
>>> it can find any threading opportunities.
>> 
>> But due to iteration uncovering new opportunities this will
>inevitably
>> break, no?
>
>No, actually because I'm comparing current and new backwards threader 
>behavior on the same IL, something I can't do with VRP because the IL
>is 
>slightly different (VRP asserts).
>
>So I can compare apples to apples in the backwards threader code.  What
>
>I do is run the new code first, and then run the old code on the same
>IL 
>before the threader has altered the CFG, while asserting that the old 
>code cannot register any "new" paths not already present in the
>registry.
>
>I have tested this assertion throughout 200+ .ii files from a
>bootstrap, 
>and there has never been a case where the old code can get something we
>
>can't.


Ah, that's nice! 

>> 
>>> After a few weeks, we could
>>> kill the old code.
>> 
>> Note that for analyzing threadings done apart from looking at overall
>> numbers the statistics infrastructure can be useful, likewise could
>> be the opt-info one where you can diff stats based on file or
>function
>> (statistics) or even location of a participating jump (opt-info).
>> 
>> If you are re-using tree-ssa-thread{edge,update}.{c,h} anyway you
>> probably only have to amend one or two places.  I'm personally
>> breaking things down to file/function via statistics to spot gross
>> differences more locally.
>
>I am indeed re-using all of tree-ssa-thread{edge,update}.  That is 
>unchanged.  I've been using the overall statistics plus an awk script
>to 
>collate it all.  But thanks for the opt-info tip.  I didn't know about
>that.
>
>> 
>> IMHO removing threading from VRP (as a step to make "VRP"
>> into another EVRP run) should be part of the initial transition,
>> there's always a thread pass nearby.  Performing threading from
>> EVRP itself might be another option to evaluate.  Trimming down
>> the number of (now backwards-)threaders would be another goal.
>
>Agreed.  I will have to sit down and investigate what VRP is getting. 
>Last I peeked it was either stuff range-ops could be taught, or 
>unconditional jumps to unconditional jumps that could be easily
>handled. 
>However, I think we can prove that the current backwards threader code 
>cannot get anything in the presence of the new code, and could be
>easily 
>replaced before concentrating on DOM/VRP.
>
>That being said, figuring out exactly the discrepancies with DOM/VRP is
>
>on my short-term radar ;-).
>
>Aldy

Re: replacing the backwards threader and more

Reply via email to