Re: Project Ranger

Richard Biener Mon, 04 Jun 2018 08:17:50 -0700

On Fri, Jun 1, 2018 at 10:38 PM Andrew MacLeod <amacl...@redhat.com> wrote:
>
> On 06/01/2018 05:48 AM, Richard Biener wrote:
>
> On Wed, May 30, 2018 at 1:53 AM Andrew MacLeod <amacl...@redhat.com> wrote:


bah, gmail now completely mangles quoting :/

>
> This allows queries for a range on an edge, on entry to a block, as an
> operand on an specific statement, or to calculate the range of the
> result of a statement.  There are no prerequisites to use it, simply
> create a path ranger and start using the API.   There is even an
> available function which can be lightly called and doesn't require
> knowledge of the ranger:
>
> So what the wiki pages do not show is how contextual information
> is handled (and how that's done w/o dominators as the wiki page
> claims).  That is, path_range_on_edge suggests that range information
> provided by the (all) controlling stmts of that edge are used when
> determining the range for 'name'.
>
> the wiki doesn't talk about it, but the document itself goes into great 
> detail of how the dynamic process works.
>
> That's probably true for the 'stmt' variant as well.
>
> This leads me to guess that either the cache has to be
> O(number of SSA uses) size or a simple VRP pass implemented
> using Ranger querying ranges of all SSA ops for each stmt
> will require O(number of SSA uses) times analysis?
>
> The current cache is not particularly efficient, its size is  #ssa_name * 
> #BB, assuming you actually go and look at every basic block, which many 
> passes do not.  It also only puts a range in there when it has to.    Im not 
> sure why it really matters unless we find a pathological case that kills 
> compile time which cannot be addressed.  This would all fall under 
> performance analysis and tweaking.  Its the end performance that matters.
>
> When a  range request is made for NAME in a block, it requests a range on 
> each incoming edge, and then unions those ranges, and caches it. This is the 
> range-on-entry cache.   The next time a range request for  NAME is made for 
> on-entry to that block, it simply picks it up from the cache.
>
> The requests for range on an edge uses the gori_map summary to decide whether 
> that block creates a range for NAME or not.
> If the block does not generate a range, it simply requests the range-on-entry 
> to that block.
> If it does generate a range, then it examines the statements defining the 
> condition, gets ranges for those,  and calculates the range that would be 
> generated on that edge.  It also requests ranges for any of those names in 
> the defining chain which originate outside this block.
>
> So it queries all the names which feed other names and so on. The queries end 
> when a statement is found that doesn't generate a useful range. Sometimes it 
> has to go hunt quite a way, but usually they are nearby.  And caching the 
> range-on-entry values prevents the lookups from doing the same work over and 
> over.
>
> Once a range has been calculated, we don't spend much more time calculating 
> it again.   Any other ssa-name which uses that name also doesnt need to 
> recalculate it.  For the most part, we walk the def chains for any given 
> ssa-name and its feeding names once and future requests pretty much come from 
> the cache.
>
> So yes, there is also a lot of recursion in here at the moment. calls for 
> ranges spawning other calls before they can be resolved.  I suspect sometimes 
> this call chain is deep. I don't know, we haven't performed any analysis on 
> it yet.
>
> One of the advantages of a DOM-style or SSA-with-ASSERT_EXPR
> pass is that the number of evaluations and the size of the cache
> isn't that way "unbound".  On the WIKI page you mention edge
> info isn't cached yet - whatever that means ;)
>
> So I wonder what's the speed for doing
>
> The on-entry caches prevent it from being "unbound". . and the intersection 
> of incoming edge calculation performs the equivalent merging of ranges that 
> DOM based processing give you.
>
> if (x > 10)
>    A
> else
>    B
> C
>
> block A has x [11, MAX]
> block B has x [MIN, 10]
> block C, at the bottom of a diamond, the 2 incoming edges union those 2 
> ranges back to [MIN, MAX] and we know nothing about the range of X again.  
> Accomplished the same thing dominators would tell you.
>
> The edge info that "isnt cached yet", (and may not ever be), is simply the 
> 'static' info calculated by evaluating the few statements at the bottom of 
> the block feeding the condition.  any values coming into the block are 
> cached.    I this case its simply looking at x > 10 and calculating the range 
> on whichever edge.    One of the original design elements called for this to 
> be cached, but in practice I don't think it needs to be.  But we haven't 
> analyzed the performance yet.. so it can only get quicker :-)
>
>
>
>
>   FOR_EACH_BB_FN (bb)
>      FOR_EACH_STMT (stmt)
>        FOR_EACH_SSA_USE (stmt)
>            path_range_on_stmt (..., SSA-USE, stmt);
>
> assuming that it takes into account controlling predicates.
> That's actually what EVRP does (within a DOM walk) given
> it tries to simplify each stmt with based on range info of the ops.
>
>
>   FOR_EACH_BB_FN (bb)
>      FOR_EACH_STMT (stmt)
>         path_range_on_stmt (..., stmt);
>
> would be sufficient.  In order to calculate the range for the definition of 
> the statement, the Ranger will query the range of each SSA operand.
>
> I had Aldy run his cumulative .ii  times with a patch added to EVRP which 
> adds a ranger and makes a call to path_range_stmt on each PHI and every stmt 
> as  evrp processes it.
>
> evrp original time: 2025055 usec
> evrp + ranger:      3787678 usec
>                     -------
>       ranger time:  1762623
>
> So to be fair we aren't actually doing anything and evrp is, but the actual 
> folding of statement is not a big time killer which is the only additional 
> thing EVRP is doing.  This indicates that for a full-on stmt-by-stmt walk and 
> process, we're certainly in the same time ballpark.
>
> A significant point is that the stmt-by-stmt walk is also not the intended 
> target. Thats old school now :-)   Many clients only care about a few ranges.
>
> The "quick and dirty" RVRP pass only looks at the last condition in the block 
> and catches the vast majority of what the stmt-by-stmt walk accomplishes. If 
> timings for the RVRP pass are any indication, I would say it walks/calculates 
> 60% of the statements in the program to calculate flow conditions.  
> Processing a lot of arithmetic statements that can never result in a fold 
> seems kinda pointless. leave those to constant prop, (which could use the 
> ranger as client).   The extra stmt-by-stmt opprtunities found mostly appear 
> to find PHIs that can be folded and the odd condition that doesn't feed a 
> branch.
>
> EVRP also *has* to walk each statement because that is the the model it uses 
> to find ranges. build them from the top down. The ranger model does not have 
> this need.   If uses of the ranger were more prolific in other passes, I 
> think a lot of things the existing VRP passes get on these statements would 
> be caught in the normal processing the other passes do.  We did put a 
> stmt-by-stmt RVRP on the todo list mostly so we could get a better 
> apple-to-apple time comparison and to make sure we would eventually have full 
> coverage of its VRP's abilities.
>
> I would also note again that we have made no attempt to optimize things yet. 
> We don't really know how much extra work we're doing that doesn't need to be 
> redone. We just know the current speed is a pretty good starting point.  
> We're seeing big improvements in the passes that are changed to use it.
>
> we will get to finding the inefficiencies and try to address them.
>
>
>
> More generally my remarks are still the same as with the irange introduction
> attempt.
>
> It's a no-no to add yet another set of range op primitives given we already 
> have
> them implemented (with trees, yes...) in tre-vrp.c.  The existing ones
> are already
> factored out to some extent so I expect you will continue that and split out
> the wide-int cases from those existing routines.  Yes - that will not be very
> C++-ish - but who cares.  Simply implement C-style workers, for example
> be inspired by the tree-ssa-ccp.c:bit_value_* routines implementing
> a kind of non-zero-bits operation primitives ontop of wide_ints (and also
> wrapping those with tree variants).
>
> Sharing of code to do the same work was covered as something we will get to. 
> Aldy is working on that so there is not dual maintenance. I had started it 
> with the wide-int-aux.[ch] file in the branch but never got very far.   Yeah, 
> its not super pretty, but it is what it is.
>
> It wont be perfect however as many of those routines are written with the 
> assumption that ranges are either a single range or an anti range. we no 
> longer have that restriction and we will diverge in those cases.

But having 1 single range or N ranges isn't different when it comes to
the core working routines.  Because if you have N ranges it's simply
apply the op N times and union the result.  Yes - the special case is
now that OP (range) can result in multiple ranges but currently
all interesting cases you likely implement are simply covered by
returning either a range or an anti-range.  Because more generally
you'd not want N ranges but some affine thing so you can handle even
more cases without blowing up storage.

> Extricating the code from the symbolic handling is also not straightforward 
> some times as there are big switches which mix and match various bits of each 
> case.

Yes.  Which is why I never really got far with dumping the symbolic
stuff because it's hard to come up with a replacement.  I guess
in reality you want to model symbolic ranges as constraints on affine
expressions and then have some constraint solver to
answer queries -- at least for symbolic ranges the interesting
transforms involve comparison simplifications (aka jump threading)
only.

> We will however try to distill it down to the wide-int aspects that are 
> common.
>
>  I would also argue that long term I see value_range disappearing 
> completely... and then there wont be any dual maintenance :-)
>
>
>
> For most of your on-demand uses I would expect a _much_ simpler and
> less heavy-weight approach to be extending the recently merged
> determine_value_range (I've pointed you to this multiple times in the
> past) to walk SSA def-use chains using the existing SSA on-the-side
> info as cache (yes, w/o use-granular contextual information).
>
> I think what we have is pretty lightweight and more powerful than that would 
> be.  I started from scratch so we could get all the ideas in one place 
> working together and not suffer from limits imposed by existing assumptions 
> or information not easily available.    I think our results confirm those 
> benefits. Its fast, easy to use, and its already starting to get things EVRP 
> does not get. It is also showing signs of finding things VRP itself has 
> trouble with.  When we handle equivalencies I would expect we will exceed 
> current capabilities.
>
>
> I also think that purely restricting yourself to wide-ints is a mistake - we
> do very well want to have range information for REAL_CSTs (or composite
> integers / reals).  How do you think of extending Ranger to handle different
> kind of types?  Eric already raised the issue of symbolic ranges.
>
> This is in fact one of the main reasons we wrote everything new, including a 
> distinct range class. We expected to want to make this work on other types.
>
> Irange and range-ops is a class to handle integer ranges.  We have a very 
> well defined API for ranges now.  The range engine itself simply uses this 
> API.
>
> If we write class 'real_range' to manipulate & store real ranges and 
> implement 'real_range_ops' to teach the compiler  how real ranges are 
> extracted from expressions, the entire ranger engine will work on reals.
>
> There are 2 options to proceed with this
>
> a) we'd either template the ranger to use <class Range> instead of class 
> irange, or at least extract out the common code and template that.  And then 
> we'll have the same on-demand ranger  for "real" ranges, or "complex" or 
> whatever.   so instead of invoking it with 'path_ranger r', it'd be something 
> like 'path_ranger<irange> r',  or path_ranger<real_range> r;
>
> I like that less because I hate templates, but it is an option.  The more 
> likely case I planned for is:
>
> b) adjust class irange to derive from a base range class with the API as 
> virtual functions, and inherit irange and real_range from that  for the 
> ranger to call. Then it would handle both simultaneously.  We'd have some 
> tweaking to do where the ranger currently checks the type of things to be 
> integral, but nothing too serious.  The entire structure of the range engine 
> will still apply, and we could process both simultaneously.
>
> The second model was also intended from the beginning to also allow us to 
> adjust the number of sub-ranges dynamically. Right now we set it to 3 
> sub-ranges at compile time, but he intention is to allow this to vary if we 
> want.  There are cases where some optimization might be willing to spend 
> extra time/memory in order to get really accurate range info. Maybe switch 
> analysis could use an irange  which allowed up to 1000 subranges so it could 
> very precisely map what the value is at any given point.  The general case 
> doesn't need that extra memory usage however.
>
> It just gives us flexibility we don't currently have.  We also expect to do 
> some tests and find out what the "optimum" default number of ranges are.  I 
> chose 3 because it was handling cases I wanted to get we couldn't get before. 
>  Maybe a different number is better, we'd have to run some experiments to see 
> how often we could use more ranges.
>
>
>
> Yes, I do like having wide-int workers for ranges.
>
> Yes, I do like the idea of making ranges not restricted to single sub-ranges
> (didn't review the implementation).
>
> But those things should have been done on the existing code, not sticking
> sth completely new into GCC that cannot subsume the existing stuff.  That's
> the road to maintainance hell which we unfortunately walk way too often
> with GCC :/
>
> I argue that this will subsume all the existing code related to it.   Sure we 
> cannot right this moment, its not a small task. I think our pathway out of 
> maintenance hell involves removing the existing value_range code and the 
> various incarnations of VRP we have and using something modern designed from 
> scratch to solve these problems.  I would also predict we could do in the 
> release after this next one. So we'd have one release with the ranger and VRP 
> brothers both present, and after that it would be just the Ranger.
>
> Btw, the original goal of EVRP was to get rid of the ASSERT_EXRP-style
> VRP pass once the threading bits it has are subsumed by the backwards
> threader.  Does ranger at least allow that - get rid of the threading inside
> VRP without regressions?
>
> The ranger calculates ranges. It does not implement a complete VRP 
> replacement.  We will implement a full VRP in time. One of the original goals 
> was to get this working so we can remove all the micro-optimizations that VRP 
> does because range info isn't available anywhere else.   The clean goal is 
> all the threading related range work should be done in threading via calls to 
> the ranger.
>
> So to more directly answer the question.   I do not know the state today.  
> I'm not the threader dude :-)   Aldy enhanced the backward threader to do 
> some things we couldn't do before, and converted those other 3 passes.  I 
> think Jeff has on his plate to remove the threading parts from VRP and put 
> them back in the threader.  So he can clarify this.   By the end of the stage 
> I would expect that threader work to be complete and out of VRP.
>
>
> Just as a final note - open source development doesn't work like this,
> citing you "I'd like to introduce a project we've been working on for
> the past year
> an a half." - this simply provokes "reviews" like the above.
>
> well, it is what it is.   I see nothing wrong with the "review".   quite 
> frankly, it's pretty much as expected.  I think it would have been almost 
> identical if I'd talked about it last year.
>
> And why can't it can work like this?  We aren't suggesting we shove this into 
> trunk right now.  We've been experimenting with the technology, and now we 
> think its developed to the point where we think it can be of benefit, and we 
> can demonstrate it works.  Its on a branch and if anyone else wants to help.  
> awesome.  We have a to do list.   It'll probably be months before we're even 
> ready to be considered for a merge.  It has now advanced to the point where 
> we have tangible things to talk about for the first time.
>
> Sure, we could replace irange with value_range and range-ops with some 
> serious reconfiguration of the range extraction code, and the ranger would 
> still work. It just needs the same range API.  I happen to think irange and 
> range-ops offers benefits that value_range can't, and so to do that would be 
> a step backwards.
>
> I chose to create irange so we had the flexibility to enhance its ability in 
> the future while offering some improvements now.
> Range-ops was a desire to unite the range processing code in one place, and I 
> needed to add the ability to do algebraic re-arrangements that we currently 
> dont do in VRP... the op1_irange and op2_irange code that derives ranges for 
> operands given a LHS and the other operand.   Yeah, its not in one place yet, 
> but it will be.
>
> So at what point do we decide community involvement is appropriate. As soon 
> as I get the idea? Maybe after I had the third prototype sort-of working 
> indicating the approach was feasible?  It was pretty ugly then.  I wouldn't 
> want anyone to have seen that :-)    All the exact same questions would come 
> up then as now.  We need to handle symbolics.  too expensive.  reuse the 
> existing range code.  Rather than waving my hands to answer and not really 
> accomplish anything, I can now point to tangible things. (mostly :-)  I would 
> have still pushed for irange. and range-ops because I think they are the 
> right strategic solution.     Maybe someone else could have helped advance 
> things a little faster, but I was a pretty big bottle neck on design and 
> reworking things. ask Aldy.
>
> The only thing that we ever had working that was even worth considering 
> submitting was the irange work last July. and that did not get very far.   We 
> could also have waited another year until we have a complete VRP replacement 
> working. Then its a drop in solution that removes a lot of the arguments 
> about code sharing and not re-using value_range. That might have been even 
> easier! I do  think its reached the point where there is a lot of benefit to 
> be derived now, and so here it is, lets figure out what we can do with it.
>
> Anyway, the "lateness" of he introduction is purely on me.  I like to get 
> things working before talking about them especially when they start with half 
> baked ideas.

I would have liked to see factoring out from current code first and
actually enhancing the current value-range represenation by removing
the
restriction on single ranges.

 I don't very much believe in the "start from scratch" approach.
Because history shows we always end up with both, the old and the new
afterwards.

Richard.

> I think this is a good strategic direction to go, demonstrates benefit now, 
> and has a path to a better/easier to maintain VRP with a compiler-wide range 
> resource.
>
> Andrew
>
>
>
>
>
>
>

Re: Project Ranger

Reply via email to