On Mon, Jun 14, 2021 at 12:02 AM Jeff Law via Gcc <gcc@gcc.gnu.org> wrote: > > > > On 6/9/2021 5:48 AM, Aldy Hernandez wrote: > > Hi Jeff. Hi folks. > > > > What started as a foray into severing the old (forward) threader's > > dependency on evrp, turned into a rewrite of the backwards threader > > code. I'd like to discuss the possibility of replacing the current > > backwards threader with a new one that gets far more threads and can > > potentially subsume all threaders in the future. > > > > I won't include code here, as it will just detract from the high level > > discussion. But if it helps, I could post what I have, which just > > needs some cleanups and porting to the latest trunk changes Andrew has > > made. > > > > Currently the backwards threader works by traversing DEF chains > > through PHIs leading to possible paths that start in a constant. When > > such a path is found, it is checked to see if it is profitable, and if > > so, the constant path is threaded. The current implementation is > > rather limited since backwards paths must end in a constant. For > > example, the backwards threader can't get any of the tests in > > gcc.dg/tree-ssa/ssa-thread-14.c: > > > > if (a && b) > > foo (); > > if (!b && c) > > bar (); > > > > etc. > Right. And these kinds of cases are particularly interesting to capture > -- not only do you remove the runtime test/compare, all the setup code > usually dies as well. I can't remember who, but someone added some bits > to detect these cases in DOM a while back and while the number of > additional jumps threaded wasn't great, the overall impact was much > better than we initially realized. Instead of allowign removal of a > single compare/branch, it typically allowed removal of a chain of > logicals that fed the conditional. > > > > > After my refactoring patches to the threading code, it is now possible > > to drop in an alternate implementation that shares the profitability > > code (is this path profitable?), the jump registry, and the actual > > jump threading code. I have leveraged this to write a ranger-based > > threader that gets every single thread the current code gets, plus > > 90-130% more. > Sweet. > > > > > Here are the details from the branch, which should be very similar to > > trunk. I'm presenting the branch numbers because they contain > > Andrew's upcoming relational query which significantly juices up the > > results. > Yea, I'm not surprised that the relational query helps significantly > here. And I'm not surprised that we can do much better with the > backwards threader with a rewrite. > > Much of the ranger design was with the idea behind using it in the > backwards jump threader in mind. Backwards threading is, IMHO, a much > better way to think about the problem. THe backwards threader also has > a much stronger region copier -- so we don't have to live with the > various limitations of the old jump threading approach. > > > > > > > New threader: > > ethread:65043 (+3.06%) > > dom:32450 (-13.3%) > > backwards threader:72482 (+89.6%) > > vrp:40532 (-30.7%) > > Total threaded: 210507 (+6.70%) > > > > This means that the new code gets 89.6% more jump threading > > opportunities than the code I want to replace. In doing so, it > > reduces the amount of DOM threading opportunities by 13.3% and by > > 30.7% from the VRP jump threader. The total improvement across the > > jump threading opportunities in the compiler is 6.70%. > This looks good at first glance. It's worth noting that the backwards > threader runs before the others, so, yea, as it captures more stuff I > would expect DOM/VRP to capture fewer things. It would be interesting > to know the breakdown of things caught by VRP1/VRP2 and how much of that > is secondary opportunities that are only appearing because we've done a > better job earlier. > > And just to be clear, I expect that we're going to leave some of those > secondary opportunities on the table -- we just don't want it to be too > many :-) When I last looked at this my sense was wiring the backwards > threader and ranger together should be enough to subsume VRP1/VRP2 jump > threading. > > > > > However, these are pessimistic numbers... > > > > I have noticed that some of the threading opportunities that DOM and > > VRP now get are not because they're smarter, but because they're > > picking up opportunities that the new code exposes. I experimented > > with running an iterative threader, and then seeing what VRP and DOM > > could actually get. This is too expensive to do in real life, but it > > at least shows what the effect of the new code is on DOM/VRP's abilities: > > > > Iterative threader: > > ethread:65043 (+3.06%) > > dom:31170 (-16.7%) > > thread:86717 (+127%) > > vrp:33851 (-42.2%) > > Total threaded: 216781 (+9.90%) > > > > This means that the new code not only gets 127% more cases, but it > > reduces the DOM and VRP opportunities considerably (16.7% and 42.2% > > respectively). The end result is that we have the possibility of > > getting almost 10% more jump threading opportunities in the entire > > compilation run. > > > > (Note that the new code gets even more opportunities, but I'm only > > reporting the profitable ones that made it all the way through to the > > threader backend, and actually eliminated a branch.) > Thanks for clarifying that. It was one of the questions that first > popped into my mind. > > > > > The overall compilation hit from this work is currently 1.38% as > > measured by callgrind. We should be able to reduce this a bit, plus > > we could get some of that back if we can replace the DOM and VRP > > threaders (future work). > Given how close we were to dropping the VRP threaders before, I would > support dropping them at the same time. That gives you a bit more > compile-time budget. > > > > > My proposed implementation should be able to get any threading > > opportunity, and will get more as range-ops and ranger improve. > > > > I can go into the details if necessary, but the gist of it is that we > > leverage the import facility in the ranger to only look up paths that > > have a direct repercussion in the conditional being threaded, thus > > reducing the search space. This enhanced path discovery, plus an > > engine to resolve conditionals based on knowledge from a CFG path, is > > all that is needed to register new paths. There is no limit to how > > far back we look, though in practice, we stop looking once a path is > > too expensive to continue the search in a given direction. > Right. That's one of the great things about Ranger -- it can > dramatically reduce the search space. > > > > > The solver API is simple: > > > > // This class is a thread path solver. Given a set of BBs indicating > > // a path through the CFG, range_in_path() will return the range > > // of an SSA as if the BBs in the path would have been executed in > > // order. > > // > > // Note that the blocks are in reverse order, thus the exit block is > > path[0]. > > > > class thread_solver : gori_compute > > { > > > > public: > > thread_solver (gimple_ranger &ranger); > > virtual ~thread_solver (); > > void set_path (const vec<basic_block> *, const bitmap_head *imports); > > void range_in_path (irange &, tree name); > > void range_in_path (irange &, gimple *); > > ... > > }; > > > > Basically, as we're discovering paths, we ask the solver what the > > value of the final conditional in a BB is in a given path. If it > > resolves, we register the path. > Exactly. Given a path, do we know enough to resolve the conditional at > the end. If so register the path as a potential jump threading opportunity. > > > > > A follow-up project would be to analyze what DOM/VRP are actually > > getting that we don't, because in theory with an enhanced ranger, we > > should be able to get everything they do (minus some float stuff, and > > some CSE things DOM does). However, IMO, this is good enough to at > > least replace the current backwards threading code. > I bet it's going to be tougher to remove DOM's threader. It knows how > to do thinks like resolve memory references using temporary equivalences > and such. But I bet it's enough to drop the VRP based threaders.
Yes. In fact I am wondering if adding threading to the not iterating FRE would make it possible to drop DOM, replacing it with instances of FRE. Richard.