On June 9, 2021 5:34:03 PM GMT+02:00, Aldy Hernandez <al...@redhat.com> wrote: > > >On 6/9/21 2:09 PM, Richard Biener wrote: >> On Wed, Jun 9, 2021 at 1:50 PM Aldy Hernandez via Gcc ><gcc@gcc.gnu.org> wrote: >>> >>> Hi Jeff. Hi folks. >>> >>> What started as a foray into severing the old (forward) threader's >>> dependency on evrp, turned into a rewrite of the backwards threader >>> code. I'd like to discuss the possibility of replacing the current >>> backwards threader with a new one that gets far more threads and can >>> potentially subsume all threaders in the future. >>> >>> I won't include code here, as it will just detract from the high >level >>> discussion. But if it helps, I could post what I have, which just >needs >>> some cleanups and porting to the latest trunk changes Andrew has >made. >>> >>> Currently the backwards threader works by traversing DEF chains >through >>> PHIs leading to possible paths that start in a constant. When such >a >>> path is found, it is checked to see if it is profitable, and if so, >the >>> constant path is threaded. The current implementation is rather >limited >>> since backwards paths must end in a constant. For example, the >>> backwards threader can't get any of the tests in >>> gcc.dg/tree-ssa/ssa-thread-14.c: >>> >>> if (a && b) >>> foo (); >>> if (!b && c) >>> bar (); >>> >>> etc. >>> >>> After my refactoring patches to the threading code, it is now >possible >>> to drop in an alternate implementation that shares the profitability >>> code (is this path profitable?), the jump registry, and the actual >jump >>> threading code. I have leveraged this to write a ranger-based >threader >>> that gets every single thread the current code gets, plus 90-130% >more. >>> >>> Here are the details from the branch, which should be very similar >to >>> trunk. I'm presenting the branch numbers because they contain >Andrew's >>> upcoming relational query which significantly juices up the results. >>> >>> New threader: >>> ethread:65043 (+3.06%) >>> dom:32450 (-13.3%) >>> backwards threader:72482 (+89.6%) >>> vrp:40532 (-30.7%) >>> Total threaded: 210507 (+6.70%) >>> >>> This means that the new code gets 89.6% more jump threading >>> opportunities than the code I want to replace. In doing so, it >reduces >>> the amount of DOM threading opportunities by 13.3% and by 30.7% from >the >>> VRP jump threader. The total improvement across the jump threading >>> opportunities in the compiler is 6.70%. >>> >>> However, these are pessimistic numbers... >>> >>> I have noticed that some of the threading opportunities that DOM and >VRP >>> now get are not because they're smarter, but because they're picking >up >>> opportunities that the new code exposes. I experimented with >running an >>> iterative threader, and then seeing what VRP and DOM could actually >get. >>> This is too expensive to do in real life, but it at least shows >what >>> the effect of the new code is on DOM/VRP's abilities: >>> >>> Iterative threader: >>> ethread:65043 (+3.06%) >>> dom:31170 (-16.7%) >>> thread:86717 (+127%) >>> vrp:33851 (-42.2%) >>> Total threaded: 216781 (+9.90%) >>> >>> This means that the new code not only gets 127% more cases, but it >>> reduces the DOM and VRP opportunities considerably (16.7% and 42.2% >>> respectively). The end result is that we have the possibility of >>> getting almost 10% more jump threading opportunities in the entire >>> compilation run. >> >> Yeah, DOM once was iterating ... >> >> You probably have noticed that we have very man (way too many) >> 'thread' passes, often in close succession with each other or >> DOM or VRP. So in the above numbers I wonder if you can break >> down the numbers individually for the actual passes (in their order)? > >Sure, I can do that. Let me whip up the old branch and gather some >info. > >> >>> (Note that the new code gets even more opportunities, but I'm only >>> reporting the profitable ones that made it all the way through to >the >>> threader backend, and actually eliminated a branch.) >>> >>> The overall compilation hit from this work is currently 1.38% as >>> measured by callgrind. We should be able to reduce this a bit, plus >we >>> could get some of that back if we can replace the DOM and VRP >threaders >>> (future work). >>> >>> My proposed implementation should be able to get any threading >>> opportunity, and will get more as range-ops and ranger improve. >>> >>> I can go into the details if necessary, but the gist of it is that >we >>> leverage the import facility in the ranger to only look up paths >that >>> have a direct repercussion in the conditional being threaded, thus >>> reducing the search space. This enhanced path discovery, plus an >engine >>> to resolve conditionals based on knowledge from a CFG path, is all >that >>> is needed to register new paths. There is no limit to how far back >we >>> look, though in practice, we stop looking once a path is too >expensive >>> to continue the search in a given direction. >>> >>> The solver API is simple: >>> >>> // This class is a thread path solver. Given a set of BBs >indicating >>> // a path through the CFG, range_in_path() will return the range >>> // of an SSA as if the BBs in the path would have been executed in >>> // order. >>> // >>> // Note that the blocks are in reverse order, thus the exit block is >>> path[0]. >>> >>> class thread_solver : gori_compute >>> { >>> >>> public: >>> thread_solver (gimple_ranger &ranger); >>> virtual ~thread_solver (); >>> void set_path (const vec<basic_block> *, const bitmap_head >*imports); >>> void range_in_path (irange &, tree name); >>> void range_in_path (irange &, gimple *); >>> ... >>> }; >>> >>> Basically, as we're discovering paths, we ask the solver what the >value >>> of the final conditional in a BB is in a given path. If it >resolves, we >>> register the path. >>> >>> A follow-up project would be to analyze what DOM/VRP are actually >>> getting that we don't, because in theory with an enhanced ranger, we >>> should be able to get everything they do (minus some float stuff, >and >>> some CSE things DOM does). However, IMO, this is good enough to at >>> least replace the current backwards threading code. >>> >>> My suggestion would be to keep both implementations, defaulting to >the >>> ranger based, and running the old code immediately after-- trapping >if >>> it can find any threading opportunities. >> >> But due to iteration uncovering new opportunities this will >inevitably >> break, no? > >No, actually because I'm comparing current and new backwards threader >behavior on the same IL, something I can't do with VRP because the IL >is >slightly different (VRP asserts). > >So I can compare apples to apples in the backwards threader code. What > >I do is run the new code first, and then run the old code on the same >IL >before the threader has altered the CFG, while asserting that the old >code cannot register any "new" paths not already present in the >registry. > >I have tested this assertion throughout 200+ .ii files from a >bootstrap, >and there has never been a case where the old code can get something we > >can't.
Ah, that's nice! >> >>> After a few weeks, we could >>> kill the old code. >> >> Note that for analyzing threadings done apart from looking at overall >> numbers the statistics infrastructure can be useful, likewise could >> be the opt-info one where you can diff stats based on file or >function >> (statistics) or even location of a participating jump (opt-info). >> >> If you are re-using tree-ssa-thread{edge,update}.{c,h} anyway you >> probably only have to amend one or two places. I'm personally >> breaking things down to file/function via statistics to spot gross >> differences more locally. > >I am indeed re-using all of tree-ssa-thread{edge,update}. That is >unchanged. I've been using the overall statistics plus an awk script >to >collate it all. But thanks for the opt-info tip. I didn't know about >that. > >> >> IMHO removing threading from VRP (as a step to make "VRP" >> into another EVRP run) should be part of the initial transition, >> there's always a thread pass nearby. Performing threading from >> EVRP itself might be another option to evaluate. Trimming down >> the number of (now backwards-)threaders would be another goal. > >Agreed. I will have to sit down and investigate what VRP is getting. >Last I peeked it was either stuff range-ops could be taught, or >unconditional jumps to unconditional jumps that could be easily >handled. >However, I think we can prove that the current backwards threader code >cannot get anything in the presence of the new code, and could be >easily >replaced before concentrating on DOM/VRP. > >That being said, figuring out exactly the discrepancies with DOM/VRP is > >on my short-term radar ;-). > >Aldy