Re: On-Demand range technology [2/5] - Major Components : How it works

Andrew MacLeod Tue, 04 Jun 2019 12:56:29 -0700

On 6/4/19 1:07 PM, Richard Biener wrote:

On June 4, 2019 6:50:07 PM GMT+02:00, Andrew MacLeod <amacl...@redhat.com> 
wrote:

On 6/4/19 11:37 AM, Richard Biener wrote:

where the single(!) query of the range of i on the alloca call
will populate N BBs caches with the Nth having ranges for
all SSA defs of i running into quadraticness in N.

well, not really.   its still linear since the 'i' used in each foo()
will be dead, so the next block will contain no range for it


   try{
       long i_2 = _1;
       i_3 += foo(i_2);
       i_4 += foo(i_3);
       i_5 += foo(i_4);
       i_6 += foo(i_5);
... repeat N times ...
      alloca(i_6); // query range of i

once i_2 is 'dead' after the first call to foo(), it will no longer be
in the on-entry cache to any other block.
The walk back never see's i_2 until the first call to foo(), so none of

It's actually

    Utemp_2 = foo(i_1);

Bb:
    i_3 = i_1 + (long) utemp_2;
    Utemp_4 = foo(i_3);

Eliding the separate stmt for the conversion. From you description of the cache 
it sounded like getting incoming values is a take-all operation. Otherwise 
you'd need negative entries as well (I thought the cache might not contain 
varying ranges for example).

ok, Utemp2 and i_1 will be live, so there are 2 ranges on entry . thenext block will have only utemp_4 and i_3 live on entry. So it'll be 2ranges per block.

It currently does have varying in the table, but the varying value iscached for the entire ssa_name in the cache. so 500 BB's with varyingrange will only consume a single range.

By negative entries I assume you mean something like the bottom of adiamond?


if (x_3 > 20)
  blah1 (x_3)
else
  blah 2(x)_3;
blah3 (x_3);

It does nothing special to come up with the range of x_3 at blah3 (). It simply unions all the incoming ranges which is [21,MAX] union[MIN,20] to produce the value [MIN, MAX] which is then put into thecache. We don't use anything special to represent varying. varying_p() merely checks if the range is [MIN, MAX]. And when puttinga value into the cache, if it is varying_p() the cache simply uses it'sunique copy of [MIN, MAX] rather than create a new one every time it isneeded.

I had originally considered not caching ranges spanning blocks in whichthe range never changes... the trade off is you have an increase incalculation time as you walk the CFG to find the last block in which itdid change. First I figured we'd first see if the less error prone fullcache causes any real problems or not. Its always an option toimplement later since the cache is self contained and can resolve itsqueries internally however it wants.

blocks after that have a need for its range, so it will not in their
caches.

The on-entry cache will only get a single range entry for each of those

blocks.. the incoming value of i_x  and thats it. The implicitly known
live-on-exit attribute is handy here.

Now, that's not to say we cant construct cases where there is a lot of
ranges being populated.. If we find the cache is really a problem, we
can start caching ranges.. so that if all these i_x's were live
somehow,
there would only be one range for each in this case, and the cache's
would simply all point to the same range.

I suppose using i in the catch block would create sth like this, a phi with a 
large in degree and thus the switch case you already know about?

It would be a PHI. None of the ranges coming into the block areconsidered live on entry, so they dont get into the cache. The PHIcalculation asks for the incoming range on the edge for each parameter,and we get a global range for the PHI definition calculated, and that isit.

However, Thats not actually the switch problem :-) the switch problemisn't due to a large degree of incoming edges, but rather the difficultyin calculating case ranges on the outgoing edges from the switch itself.

stepping back slightly, Branches are handled generically the same wayany expression is, the outgoing edge you want information about providesa constant range for the implicit LHS of the branch.


so for an if
  c_2 = x_3 > 20
  if (c_2 != 0)

the TRUE edge has a range of bool [1,1] , fed back as the implicit LHSof the branch expression[1,1] = (c_2 != 0) which range-ops for '!=' says c_2 must be [1,1] inorder to satisfy the equation.[1,1] = x_3 > 20 which range-ops for '>' says x_3 must have a range of[21, MAX]

if you want the range on the FALSE edge, the edge starts with the range[0,0] into the equation solver, and out pops [MIN, 20].


switch (i_3)

for a switch, it works the same way, except the edge has an integralvalue, not a boolean one. This range is fed back into the implicitbranch expression:[edge range] = i_3 , so i_3 has whatever value the edge has sinceit amounts to a copy. Calculating this edge constant value is the problem.


BB2:
switch (i_3) {
BB3:
  case 4:
  case 5:
  case 9:
  case 10:
     foo (i_3)

In order to evaluate the case values on the edge from BB2 to BB3 as [4,5][9,10], we have to loop over all the cases that have a label atthe beginning of the block, and union them together. Calculating thedefault value is even worse, start with varying and intersect out eachand every case.

The gimple representation does not make evaluating this easy nor cheapsince it merely has a list of case LOW/HIGHS with labels... so you haveto loop thru that entire list for each edge being evaluated to see ifthe label is associated with this edge and then union together all thosethat match. This is linear and cant be shortcut. multiple it by thenumber of edges since you probably want a range on each edge and itssuddenly exponential. The result is it can be *very* time consumingif the switch is very, very large.

And it's a constant value that never changes, so it really could becheap. We've just never needed to ask this exact question before. We'veconsidered multiple ways to address this, but we wont worry about anyof them just yet. All in time :-)

so far we havent really run across anything that appears to trigger
significant concerns.

Sure. It's still worth thinking about worst case behavior and how you'd deal 
with that given once ranger is actually used we will eventually run into these 
cases. When
A forward evrp walk is then faster than a single on demand query that would be 
concerning.

Richard.

Absolutely. I've been concerned about worst case since the get go.Thats how its evolved to this design. We're trying to make sure theworst case scenario is no worse than doing a normal walk.

We can currently run the ranger version of RVRP across blocks indominator order, reverse dominator order, linearly BB1 -> BBN, andfinally BBN-> BB1 There are fluctuations in timings, but nothing oftoo much significance. They all run in similar times and pass all thetests.

I have thoughts on how the various pieces can fit together with EVRP, I'll stick that in another email.


Andrew

Re: On-Demand range technology [2/5] - Major Components : How it works

Reply via email to