On Fri, Jun 1, 2018 at 10:38 PM Andrew MacLeod <amacl...@redhat.com> wrote: > > On 06/01/2018 05:48 AM, Richard Biener wrote: > > On Wed, May 30, 2018 at 1:53 AM Andrew MacLeod <amacl...@redhat.com> wrote:
bah, gmail now completely mangles quoting :/ > > This allows queries for a range on an edge, on entry to a block, as an > operand on an specific statement, or to calculate the range of the > result of a statement. There are no prerequisites to use it, simply > create a path ranger and start using the API. There is even an > available function which can be lightly called and doesn't require > knowledge of the ranger: > > So what the wiki pages do not show is how contextual information > is handled (and how that's done w/o dominators as the wiki page > claims). That is, path_range_on_edge suggests that range information > provided by the (all) controlling stmts of that edge are used when > determining the range for 'name'. > > the wiki doesn't talk about it, but the document itself goes into great > detail of how the dynamic process works. > > That's probably true for the 'stmt' variant as well. > > This leads me to guess that either the cache has to be > O(number of SSA uses) size or a simple VRP pass implemented > using Ranger querying ranges of all SSA ops for each stmt > will require O(number of SSA uses) times analysis? > > The current cache is not particularly efficient, its size is #ssa_name * > #BB, assuming you actually go and look at every basic block, which many > passes do not. It also only puts a range in there when it has to. Im not > sure why it really matters unless we find a pathological case that kills > compile time which cannot be addressed. This would all fall under > performance analysis and tweaking. Its the end performance that matters. > > When a range request is made for NAME in a block, it requests a range on > each incoming edge, and then unions those ranges, and caches it. This is the > range-on-entry cache. The next time a range request for NAME is made for > on-entry to that block, it simply picks it up from the cache. > > The requests for range on an edge uses the gori_map summary to decide whether > that block creates a range for NAME or not. > If the block does not generate a range, it simply requests the range-on-entry > to that block. > If it does generate a range, then it examines the statements defining the > condition, gets ranges for those, and calculates the range that would be > generated on that edge. It also requests ranges for any of those names in > the defining chain which originate outside this block. > > So it queries all the names which feed other names and so on. The queries end > when a statement is found that doesn't generate a useful range. Sometimes it > has to go hunt quite a way, but usually they are nearby. And caching the > range-on-entry values prevents the lookups from doing the same work over and > over. > > Once a range has been calculated, we don't spend much more time calculating > it again. Any other ssa-name which uses that name also doesnt need to > recalculate it. For the most part, we walk the def chains for any given > ssa-name and its feeding names once and future requests pretty much come from > the cache. > > So yes, there is also a lot of recursion in here at the moment. calls for > ranges spawning other calls before they can be resolved. I suspect sometimes > this call chain is deep. I don't know, we haven't performed any analysis on > it yet. > > One of the advantages of a DOM-style or SSA-with-ASSERT_EXPR > pass is that the number of evaluations and the size of the cache > isn't that way "unbound". On the WIKI page you mention edge > info isn't cached yet - whatever that means ;) > > So I wonder what's the speed for doing > > The on-entry caches prevent it from being "unbound". . and the intersection > of incoming edge calculation performs the equivalent merging of ranges that > DOM based processing give you. > > if (x > 10) > A > else > B > C > > block A has x [11, MAX] > block B has x [MIN, 10] > block C, at the bottom of a diamond, the 2 incoming edges union those 2 > ranges back to [MIN, MAX] and we know nothing about the range of X again. > Accomplished the same thing dominators would tell you. > > The edge info that "isnt cached yet", (and may not ever be), is simply the > 'static' info calculated by evaluating the few statements at the bottom of > the block feeding the condition. any values coming into the block are > cached. I this case its simply looking at x > 10 and calculating the range > on whichever edge. One of the original design elements called for this to > be cached, but in practice I don't think it needs to be. But we haven't > analyzed the performance yet.. so it can only get quicker :-) > > > > > FOR_EACH_BB_FN (bb) > FOR_EACH_STMT (stmt) > FOR_EACH_SSA_USE (stmt) > path_range_on_stmt (..., SSA-USE, stmt); > > assuming that it takes into account controlling predicates. > That's actually what EVRP does (within a DOM walk) given > it tries to simplify each stmt with based on range info of the ops. > > > FOR_EACH_BB_FN (bb) > FOR_EACH_STMT (stmt) > path_range_on_stmt (..., stmt); > > would be sufficient. In order to calculate the range for the definition of > the statement, the Ranger will query the range of each SSA operand. > > I had Aldy run his cumulative .ii times with a patch added to EVRP which > adds a ranger and makes a call to path_range_stmt on each PHI and every stmt > as evrp processes it. > > evrp original time: 2025055 usec > evrp + ranger: 3787678 usec > ------- > ranger time: 1762623 > > So to be fair we aren't actually doing anything and evrp is, but the actual > folding of statement is not a big time killer which is the only additional > thing EVRP is doing. This indicates that for a full-on stmt-by-stmt walk and > process, we're certainly in the same time ballpark. > > A significant point is that the stmt-by-stmt walk is also not the intended > target. Thats old school now :-) Many clients only care about a few ranges. > > The "quick and dirty" RVRP pass only looks at the last condition in the block > and catches the vast majority of what the stmt-by-stmt walk accomplishes. If > timings for the RVRP pass are any indication, I would say it walks/calculates > 60% of the statements in the program to calculate flow conditions. > Processing a lot of arithmetic statements that can never result in a fold > seems kinda pointless. leave those to constant prop, (which could use the > ranger as client). The extra stmt-by-stmt opprtunities found mostly appear > to find PHIs that can be folded and the odd condition that doesn't feed a > branch. > > EVRP also *has* to walk each statement because that is the the model it uses > to find ranges. build them from the top down. The ranger model does not have > this need. If uses of the ranger were more prolific in other passes, I > think a lot of things the existing VRP passes get on these statements would > be caught in the normal processing the other passes do. We did put a > stmt-by-stmt RVRP on the todo list mostly so we could get a better > apple-to-apple time comparison and to make sure we would eventually have full > coverage of its VRP's abilities. > > I would also note again that we have made no attempt to optimize things yet. > We don't really know how much extra work we're doing that doesn't need to be > redone. We just know the current speed is a pretty good starting point. > We're seeing big improvements in the passes that are changed to use it. > > we will get to finding the inefficiencies and try to address them. > > > > More generally my remarks are still the same as with the irange introduction > attempt. > > It's a no-no to add yet another set of range op primitives given we already > have > them implemented (with trees, yes...) in tre-vrp.c. The existing ones > are already > factored out to some extent so I expect you will continue that and split out > the wide-int cases from those existing routines. Yes - that will not be very > C++-ish - but who cares. Simply implement C-style workers, for example > be inspired by the tree-ssa-ccp.c:bit_value_* routines implementing > a kind of non-zero-bits operation primitives ontop of wide_ints (and also > wrapping those with tree variants). > > Sharing of code to do the same work was covered as something we will get to. > Aldy is working on that so there is not dual maintenance. I had started it > with the wide-int-aux.[ch] file in the branch but never got very far. Yeah, > its not super pretty, but it is what it is. > > It wont be perfect however as many of those routines are written with the > assumption that ranges are either a single range or an anti range. we no > longer have that restriction and we will diverge in those cases. But having 1 single range or N ranges isn't different when it comes to the core working routines. Because if you have N ranges it's simply apply the op N times and union the result. Yes - the special case is now that OP (range) can result in multiple ranges but currently all interesting cases you likely implement are simply covered by returning either a range or an anti-range. Because more generally you'd not want N ranges but some affine thing so you can handle even more cases without blowing up storage. > Extricating the code from the symbolic handling is also not straightforward > some times as there are big switches which mix and match various bits of each > case. Yes. Which is why I never really got far with dumping the symbolic stuff because it's hard to come up with a replacement. I guess in reality you want to model symbolic ranges as constraints on affine expressions and then have some constraint solver to answer queries -- at least for symbolic ranges the interesting transforms involve comparison simplifications (aka jump threading) only. > We will however try to distill it down to the wide-int aspects that are > common. > > I would also argue that long term I see value_range disappearing > completely... and then there wont be any dual maintenance :-) > > > > For most of your on-demand uses I would expect a _much_ simpler and > less heavy-weight approach to be extending the recently merged > determine_value_range (I've pointed you to this multiple times in the > past) to walk SSA def-use chains using the existing SSA on-the-side > info as cache (yes, w/o use-granular contextual information). > > I think what we have is pretty lightweight and more powerful than that would > be. I started from scratch so we could get all the ideas in one place > working together and not suffer from limits imposed by existing assumptions > or information not easily available. I think our results confirm those > benefits. Its fast, easy to use, and its already starting to get things EVRP > does not get. It is also showing signs of finding things VRP itself has > trouble with. When we handle equivalencies I would expect we will exceed > current capabilities. > > > I also think that purely restricting yourself to wide-ints is a mistake - we > do very well want to have range information for REAL_CSTs (or composite > integers / reals). How do you think of extending Ranger to handle different > kind of types? Eric already raised the issue of symbolic ranges. > > This is in fact one of the main reasons we wrote everything new, including a > distinct range class. We expected to want to make this work on other types. > > Irange and range-ops is a class to handle integer ranges. We have a very > well defined API for ranges now. The range engine itself simply uses this > API. > > If we write class 'real_range' to manipulate & store real ranges and > implement 'real_range_ops' to teach the compiler how real ranges are > extracted from expressions, the entire ranger engine will work on reals. > > There are 2 options to proceed with this > > a) we'd either template the ranger to use <class Range> instead of class > irange, or at least extract out the common code and template that. And then > we'll have the same on-demand ranger for "real" ranges, or "complex" or > whatever. so instead of invoking it with 'path_ranger r', it'd be something > like 'path_ranger<irange> r', or path_ranger<real_range> r; > > I like that less because I hate templates, but it is an option. The more > likely case I planned for is: > > b) adjust class irange to derive from a base range class with the API as > virtual functions, and inherit irange and real_range from that for the > ranger to call. Then it would handle both simultaneously. We'd have some > tweaking to do where the ranger currently checks the type of things to be > integral, but nothing too serious. The entire structure of the range engine > will still apply, and we could process both simultaneously. > > The second model was also intended from the beginning to also allow us to > adjust the number of sub-ranges dynamically. Right now we set it to 3 > sub-ranges at compile time, but he intention is to allow this to vary if we > want. There are cases where some optimization might be willing to spend > extra time/memory in order to get really accurate range info. Maybe switch > analysis could use an irange which allowed up to 1000 subranges so it could > very precisely map what the value is at any given point. The general case > doesn't need that extra memory usage however. > > It just gives us flexibility we don't currently have. We also expect to do > some tests and find out what the "optimum" default number of ranges are. I > chose 3 because it was handling cases I wanted to get we couldn't get before. > Maybe a different number is better, we'd have to run some experiments to see > how often we could use more ranges. > > > > Yes, I do like having wide-int workers for ranges. > > Yes, I do like the idea of making ranges not restricted to single sub-ranges > (didn't review the implementation). > > But those things should have been done on the existing code, not sticking > sth completely new into GCC that cannot subsume the existing stuff. That's > the road to maintainance hell which we unfortunately walk way too often > with GCC :/ > > I argue that this will subsume all the existing code related to it. Sure we > cannot right this moment, its not a small task. I think our pathway out of > maintenance hell involves removing the existing value_range code and the > various incarnations of VRP we have and using something modern designed from > scratch to solve these problems. I would also predict we could do in the > release after this next one. So we'd have one release with the ranger and VRP > brothers both present, and after that it would be just the Ranger. > > Btw, the original goal of EVRP was to get rid of the ASSERT_EXRP-style > VRP pass once the threading bits it has are subsumed by the backwards > threader. Does ranger at least allow that - get rid of the threading inside > VRP without regressions? > > The ranger calculates ranges. It does not implement a complete VRP > replacement. We will implement a full VRP in time. One of the original goals > was to get this working so we can remove all the micro-optimizations that VRP > does because range info isn't available anywhere else. The clean goal is > all the threading related range work should be done in threading via calls to > the ranger. > > So to more directly answer the question. I do not know the state today. > I'm not the threader dude :-) Aldy enhanced the backward threader to do > some things we couldn't do before, and converted those other 3 passes. I > think Jeff has on his plate to remove the threading parts from VRP and put > them back in the threader. So he can clarify this. By the end of the stage > I would expect that threader work to be complete and out of VRP. > > > Just as a final note - open source development doesn't work like this, > citing you "I'd like to introduce a project we've been working on for > the past year > an a half." - this simply provokes "reviews" like the above. > > well, it is what it is. I see nothing wrong with the "review". quite > frankly, it's pretty much as expected. I think it would have been almost > identical if I'd talked about it last year. > > And why can't it can work like this? We aren't suggesting we shove this into > trunk right now. We've been experimenting with the technology, and now we > think its developed to the point where we think it can be of benefit, and we > can demonstrate it works. Its on a branch and if anyone else wants to help. > awesome. We have a to do list. It'll probably be months before we're even > ready to be considered for a merge. It has now advanced to the point where > we have tangible things to talk about for the first time. > > Sure, we could replace irange with value_range and range-ops with some > serious reconfiguration of the range extraction code, and the ranger would > still work. It just needs the same range API. I happen to think irange and > range-ops offers benefits that value_range can't, and so to do that would be > a step backwards. > > I chose to create irange so we had the flexibility to enhance its ability in > the future while offering some improvements now. > Range-ops was a desire to unite the range processing code in one place, and I > needed to add the ability to do algebraic re-arrangements that we currently > dont do in VRP... the op1_irange and op2_irange code that derives ranges for > operands given a LHS and the other operand. Yeah, its not in one place yet, > but it will be. > > So at what point do we decide community involvement is appropriate. As soon > as I get the idea? Maybe after I had the third prototype sort-of working > indicating the approach was feasible? It was pretty ugly then. I wouldn't > want anyone to have seen that :-) All the exact same questions would come > up then as now. We need to handle symbolics. too expensive. reuse the > existing range code. Rather than waving my hands to answer and not really > accomplish anything, I can now point to tangible things. (mostly :-) I would > have still pushed for irange. and range-ops because I think they are the > right strategic solution. Maybe someone else could have helped advance > things a little faster, but I was a pretty big bottle neck on design and > reworking things. ask Aldy. > > The only thing that we ever had working that was even worth considering > submitting was the irange work last July. and that did not get very far. We > could also have waited another year until we have a complete VRP replacement > working. Then its a drop in solution that removes a lot of the arguments > about code sharing and not re-using value_range. That might have been even > easier! I do think its reached the point where there is a lot of benefit to > be derived now, and so here it is, lets figure out what we can do with it. > > Anyway, the "lateness" of he introduction is purely on me. I like to get > things working before talking about them especially when they start with half > baked ideas. I would have liked to see factoring out from current code first and actually enhancing the current value-range represenation by removing the restriction on single ranges. I don't very much believe in the "start from scratch" approach. Because history shows we always end up with both, the old and the new afterwards. Richard. > I think this is a good strategic direction to go, demonstrates benefit now, > and has a path to a better/easier to maintain VRP with a compiler-wide range > resource. > > Andrew > > > > > > >