Re: replacing the backwards threader and more

Jeff Law via Gcc Sun, 13 Jun 2021 15:01:43 -0700



On 6/9/2021 5:48 AM, Aldy Hernandez wrote:

Hi Jeff.  Hi folks.
What started as a foray into severing the old (forward) threader'sdependency on evrp, turned into a rewrite of the backwards threadercode. I'd like to discuss the possibility of replacing the currentbackwards threader with a new one that gets far more threads and canpotentially subsume all threaders in the future.
I won't include code here, as it will just detract from the high leveldiscussion. But if it helps, I could post what I have, which justneeds some cleanups and porting to the latest trunk changes Andrew hasmade.
Currently the backwards threader works by traversing DEF chainsthrough PHIs leading to possible paths that start in a constant. Whensuch a path is found, it is checked to see if it is profitable, and ifso, the constant path is threaded. The current implementation israther limited since backwards paths must end in a constant. Forexample, the backwards threader can't get any of the tests ingcc.dg/tree-ssa/ssa-thread-14.c:
  if (a && b)
    foo ();
  if (!b && c)
    bar ();

etc.

Right. And these kinds of cases are particularly interesting to capture-- not only do you remove the runtime test/compare, all the setup codeusually dies as well. I can't remember who, but someone added some bitsto detect these cases in DOM a while back and while the number ofadditional jumps threaded wasn't great, the overall impact was muchbetter than we initially realized. Instead of allowign removal of asingle compare/branch, it typically allowed removal of a chain oflogicals that fed the conditional.

After my refactoring patches to the threading code, it is now possibleto drop in an alternate implementation that shares the profitabilitycode (is this path profitable?), the jump registry, and the actualjump threading code. I have leveraged this to write a ranger-basedthreader that gets every single thread the current code gets, plus90-130% more.

Sweet.

Here are the details from the branch, which should be very similar totrunk. I'm presenting the branch numbers because they containAndrew's upcoming relational query which significantly juices up theresults.

Yea, I'm not surprised that the relational query helps significantlyhere. And I'm not surprised that we can do much better with thebackwards threader with a rewrite.

Much of the ranger design was with the idea behind using it in thebackwards jump threader in mind. Backwards threading is, IMHO, a muchbetter way to think about the problem. THe backwards threader also hasa much stronger region copier -- so we don't have to live with thevarious limitations of the old jump threading approach.

New threader:
         ethread:65043    (+3.06%)
         dom:32450      (-13.3%)
         backwards threader:72482   (+89.6%)
         vrp:40532      (-30.7%)
  Total threaded:  210507 (+6.70%)
This means that the new code gets 89.6% more jump threadingopportunities than the code I want to replace. In doing so, itreduces the amount of DOM threading opportunities by 13.3% and by30.7% from the VRP jump threader. The total improvement across thejump threading opportunities in the compiler is 6.70%.

This looks good at first glance. It's worth noting that the backwardsthreader runs before the others, so, yea, as it captures more stuff Iwould expect DOM/VRP to capture fewer things. It would be interestingto know the breakdown of things caught by VRP1/VRP2 and how much of thatis secondary opportunities that are only appearing because we've done abetter job earlier.

And just to be clear, I expect that we're going to leave some of thosesecondary opportunities on the table -- we just don't want it to be toomany :-) When I last looked at this my sense was wiring the backwardsthreader and ranger together should be enough to subsume VRP1/VRP2 jumpthreading.

However, these are pessimistic numbers...
I have noticed that some of the threading opportunities that DOM andVRP now get are not because they're smarter, but because they'repicking up opportunities that the new code exposes. I experimentedwith running an iterative threader, and then seeing what VRP and DOMcould actually get. This is too expensive to do in real life, but itat least shows what the effect of the new code is on DOM/VRP's abilities:
  Iterative threader:
    ethread:65043    (+3.06%)
    dom:31170    (-16.7%)
        thread:86717    (+127%)
        vrp:33851    (-42.2%)
  Total threaded:  216781 (+9.90%)
This means that the new code not only gets 127% more cases, but itreduces the DOM and VRP opportunities considerably (16.7% and 42.2%respectively). The end result is that we have the possibility ofgetting almost 10% more jump threading opportunities in the entirecompilation run.
(Note that the new code gets even more opportunities, but I'm onlyreporting the profitable ones that made it all the way through to thethreader backend, and actually eliminated a branch.)

Thanks for clarifying that. It was one of the questions that firstpopped into my mind.

The overall compilation hit from this work is currently 1.38% asmeasured by callgrind. We should be able to reduce this a bit, pluswe could get some of that back if we can replace the DOM and VRPthreaders (future work).

Given how close we were to dropping the VRP threaders before, I wouldsupport dropping them at the same time. That gives you a bit morecompile-time budget.

My proposed implementation should be able to get any threadingopportunity, and will get more as range-ops and ranger improve.
I can go into the details if necessary, but the gist of it is that weleverage the import facility in the ranger to only look up paths thathave a direct repercussion in the conditional being threaded, thusreducing the search space. This enhanced path discovery, plus anengine to resolve conditionals based on knowledge from a CFG path, isall that is needed to register new paths. There is no limit to howfar back we look, though in practice, we stop looking once a path istoo expensive to continue the search in a given direction.

Right. That's one of the great things about Ranger -- it candramatically reduce the search space.

The solver API is simple:

// This class is a thread path solver.  Given a set of BBs indicating
// a path through the CFG, range_in_path() will return the range
// of an SSA as if the BBs in the path would have been executed in
// order.
//
// Note that the blocks are in reverse order, thus the exit block ispath[0].
class thread_solver : gori_compute
{

public:
  thread_solver (gimple_ranger &ranger);
  virtual ~thread_solver ();
  void set_path (const vec<basic_block> *, const bitmap_head *imports);
  void range_in_path (irange &, tree name);
  void range_in_path (irange &, gimple *);
...
};
Basically, as we're discovering paths, we ask the solver what thevalue of the final conditional in a BB is in a given path. If itresolves, we register the path.

Exactly. Given a path, do we know enough to resolve the conditional atthe end. If so register the path as a potential jump threading opportunity.

A follow-up project would be to analyze what DOM/VRP are actuallygetting that we don't, because in theory with an enhanced ranger, weshould be able to get everything they do (minus some float stuff, andsome CSE things DOM does). However, IMO, this is good enough to atleast replace the current backwards threading code.

I bet it's going to be tougher to remove DOM's threader. It knows howto do thinks like resolve memory references using temporary equivalencesand such. But I bet it's enough to drop the VRP based threaders.

My suggestion would be to keep both implementations, defaulting to theranger based, and running the old code immediately after-- trapping ifit can find any threading opportunities. After a few weeks, we couldkill the old code.

I can live with that :-)

p.s. BTW, ranger-based is technically a minomer. It's gori based. Wedon't need the entire ranger caching ability here. I'm only using itto get the imports for the interesting conditionals, since those arestatic.

That's probably sufficient to capture the first order effects. I cancome up with theoritical cases that it'd likely miss (involvingtemporary equivalences which could affect the set of interestingconditionals), but I doubt they're important in practice.


jeff

Re: replacing the backwards threader and more

Reply via email to