On June 9, 2021 5:34:03 PM GMT+02:00, Aldy Hernandez <[email protected]> wrote:
>
>
>On 6/9/21 2:09 PM, Richard Biener wrote:
>> On Wed, Jun 9, 2021 at 1:50 PM Aldy Hernandez via Gcc
><[email protected]> wrote:
>>>
>>> Hi Jeff. Hi folks.
>>>
>>> What started as a foray into severing the old (forward) threader's
>>> dependency on evrp, turned into a rewrite of the backwards threader
>>> code. I'd like to discuss the possibility of replacing the current
>>> backwards threader with a new one that gets far more threads and can
>>> potentially subsume all threaders in the future.
>>>
>>> I won't include code here, as it will just detract from the high
>level
>>> discussion. But if it helps, I could post what I have, which just
>needs
>>> some cleanups and porting to the latest trunk changes Andrew has
>made.
>>>
>>> Currently the backwards threader works by traversing DEF chains
>through
>>> PHIs leading to possible paths that start in a constant. When such
>a
>>> path is found, it is checked to see if it is profitable, and if so,
>the
>>> constant path is threaded. The current implementation is rather
>limited
>>> since backwards paths must end in a constant. For example, the
>>> backwards threader can't get any of the tests in
>>> gcc.dg/tree-ssa/ssa-thread-14.c:
>>>
>>> if (a && b)
>>> foo ();
>>> if (!b && c)
>>> bar ();
>>>
>>> etc.
>>>
>>> After my refactoring patches to the threading code, it is now
>possible
>>> to drop in an alternate implementation that shares the profitability
>>> code (is this path profitable?), the jump registry, and the actual
>jump
>>> threading code. I have leveraged this to write a ranger-based
>threader
>>> that gets every single thread the current code gets, plus 90-130%
>more.
>>>
>>> Here are the details from the branch, which should be very similar
>to
>>> trunk. I'm presenting the branch numbers because they contain
>Andrew's
>>> upcoming relational query which significantly juices up the results.
>>>
>>> New threader:
>>> ethread:65043 (+3.06%)
>>> dom:32450 (-13.3%)
>>> backwards threader:72482 (+89.6%)
>>> vrp:40532 (-30.7%)
>>> Total threaded: 210507 (+6.70%)
>>>
>>> This means that the new code gets 89.6% more jump threading
>>> opportunities than the code I want to replace. In doing so, it
>reduces
>>> the amount of DOM threading opportunities by 13.3% and by 30.7% from
>the
>>> VRP jump threader. The total improvement across the jump threading
>>> opportunities in the compiler is 6.70%.
>>>
>>> However, these are pessimistic numbers...
>>>
>>> I have noticed that some of the threading opportunities that DOM and
>VRP
>>> now get are not because they're smarter, but because they're picking
>up
>>> opportunities that the new code exposes. I experimented with
>running an
>>> iterative threader, and then seeing what VRP and DOM could actually
>get.
>>> This is too expensive to do in real life, but it at least shows
>what
>>> the effect of the new code is on DOM/VRP's abilities:
>>>
>>> Iterative threader:
>>> ethread:65043 (+3.06%)
>>> dom:31170 (-16.7%)
>>> thread:86717 (+127%)
>>> vrp:33851 (-42.2%)
>>> Total threaded: 216781 (+9.90%)
>>>
>>> This means that the new code not only gets 127% more cases, but it
>>> reduces the DOM and VRP opportunities considerably (16.7% and 42.2%
>>> respectively). The end result is that we have the possibility of
>>> getting almost 10% more jump threading opportunities in the entire
>>> compilation run.
>>
>> Yeah, DOM once was iterating ...
>>
>> You probably have noticed that we have very man (way too many)
>> 'thread' passes, often in close succession with each other or
>> DOM or VRP. So in the above numbers I wonder if you can break
>> down the numbers individually for the actual passes (in their order)?
>
>Sure, I can do that. Let me whip up the old branch and gather some
>info.
>
>>
>>> (Note that the new code gets even more opportunities, but I'm only
>>> reporting the profitable ones that made it all the way through to
>the
>>> threader backend, and actually eliminated a branch.)
>>>
>>> The overall compilation hit from this work is currently 1.38% as
>>> measured by callgrind. We should be able to reduce this a bit, plus
>we
>>> could get some of that back if we can replace the DOM and VRP
>threaders
>>> (future work).
>>>
>>> My proposed implementation should be able to get any threading
>>> opportunity, and will get more as range-ops and ranger improve.
>>>
>>> I can go into the details if necessary, but the gist of it is that
>we
>>> leverage the import facility in the ranger to only look up paths
>that
>>> have a direct repercussion in the conditional being threaded, thus
>>> reducing the search space. This enhanced path discovery, plus an
>engine
>>> to resolve conditionals based on knowledge from a CFG path, is all
>that
>>> is needed to register new paths. There is no limit to how far back
>we
>>> look, though in practice, we stop looking once a path is too
>expensive
>>> to continue the search in a given direction.
>>>
>>> The solver API is simple:
>>>
>>> // This class is a thread path solver. Given a set of BBs
>indicating
>>> // a path through the CFG, range_in_path() will return the range
>>> // of an SSA as if the BBs in the path would have been executed in
>>> // order.
>>> //
>>> // Note that the blocks are in reverse order, thus the exit block is
>>> path[0].
>>>
>>> class thread_solver : gori_compute
>>> {
>>>
>>> public:
>>> thread_solver (gimple_ranger &ranger);
>>> virtual ~thread_solver ();
>>> void set_path (const vec<basic_block> *, const bitmap_head
>*imports);
>>> void range_in_path (irange &, tree name);
>>> void range_in_path (irange &, gimple *);
>>> ...
>>> };
>>>
>>> Basically, as we're discovering paths, we ask the solver what the
>value
>>> of the final conditional in a BB is in a given path. If it
>resolves, we
>>> register the path.
>>>
>>> A follow-up project would be to analyze what DOM/VRP are actually
>>> getting that we don't, because in theory with an enhanced ranger, we
>>> should be able to get everything they do (minus some float stuff,
>and
>>> some CSE things DOM does). However, IMO, this is good enough to at
>>> least replace the current backwards threading code.
>>>
>>> My suggestion would be to keep both implementations, defaulting to
>the
>>> ranger based, and running the old code immediately after-- trapping
>if
>>> it can find any threading opportunities.
>>
>> But due to iteration uncovering new opportunities this will
>inevitably
>> break, no?
>
>No, actually because I'm comparing current and new backwards threader
>behavior on the same IL, something I can't do with VRP because the IL
>is
>slightly different (VRP asserts).
>
>So I can compare apples to apples in the backwards threader code. What
>
>I do is run the new code first, and then run the old code on the same
>IL
>before the threader has altered the CFG, while asserting that the old
>code cannot register any "new" paths not already present in the
>registry.
>
>I have tested this assertion throughout 200+ .ii files from a
>bootstrap,
>and there has never been a case where the old code can get something we
>
>can't.
Ah, that's nice!
>>
>>> After a few weeks, we could
>>> kill the old code.
>>
>> Note that for analyzing threadings done apart from looking at overall
>> numbers the statistics infrastructure can be useful, likewise could
>> be the opt-info one where you can diff stats based on file or
>function
>> (statistics) or even location of a participating jump (opt-info).
>>
>> If you are re-using tree-ssa-thread{edge,update}.{c,h} anyway you
>> probably only have to amend one or two places. I'm personally
>> breaking things down to file/function via statistics to spot gross
>> differences more locally.
>
>I am indeed re-using all of tree-ssa-thread{edge,update}. That is
>unchanged. I've been using the overall statistics plus an awk script
>to
>collate it all. But thanks for the opt-info tip. I didn't know about
>that.
>
>>
>> IMHO removing threading from VRP (as a step to make "VRP"
>> into another EVRP run) should be part of the initial transition,
>> there's always a thread pass nearby. Performing threading from
>> EVRP itself might be another option to evaluate. Trimming down
>> the number of (now backwards-)threaders would be another goal.
>
>Agreed. I will have to sit down and investigate what VRP is getting.
>Last I peeked it was either stuff range-ops could be taught, or
>unconditional jumps to unconditional jumps that could be easily
>handled.
>However, I think we can prove that the current backwards threader code
>cannot get anything in the presence of the new code, and could be
>easily
>replaced before concentrating on DOM/VRP.
>
>That being said, figuring out exactly the discrepancies with DOM/VRP is
>
>on my short-term radar ;-).
>
>Aldy