compiling very large functions.
I think that it is time that we in the GCC community took some time to address the problem of compiling very large functions in a somewhat systematic manner. GCC has two competing interests here: it needs to be able to provide state of the art optimization for modest sized functions and it needs to be able to properly process very large machine generated functions using reasonable resources. I believe that the default behavior for the compiler should be that certain non essential passes be skipped if a very large function is encountered. There are two problems here: 1) defining the set of optimizations that need to be skipped. 2) defining the set of functions that trigger the special processing. For (1) I would propose that three measures be made of each function. These measures should be made before inlining occurs. The three measures are the number of variables, the number of statements, and the number of basic blocks. Many of the gcc passes are non linear in one or more of these measures and these passes should be skipped if one or more of these measures exceeds some threshold. For (2) I would propose that we add three new fields to the compilation manager. These fields would be null or zero if the optimization is either essential or is only linear in the measure. Otherwise, some indication of either a threshold or the exponent of the growth is used as the field. The compilation manager could then look at the options, in particular the -O level and perhaps some new options to indicate that this is a small machine or in the other extreme "optimize all functions come hell or high water!!" and skip those passes which will cause performance problems. I do not claim to understand how sensitive every pass is to these measures. However, I could possibly make a good first cut on the rtl passes. However, I think that before anyone starts hacking anything, we should come to a consensus on an overall strategy and implement something consistent for the entire compiler rather than attack some particular pass in a manner that only gets us pass the next pr. Volunteer(s) to implement the compilation manager part of this would also be appreciated. Kenny
Re: compiling very large functions.
Richard Guenther wrote: > On 11/4/06, Kenneth Zadeck <[EMAIL PROTECTED]> wrote: >> I think that it is time that we in the GCC community took some time to >> address the problem of compiling very large functions in a somewhat >> systematic manner. >> >> GCC has two competing interests here: it needs to be able to provide >> state of the art optimization for modest sized functions and it needs to >> be able to properly process very large machine generated functions using >> reasonable resources. >> >> I believe that the default behavior for the compiler should be that >> certain non essential passes be skipped if a very large function is >> encountered. >> >> There are two problems here: >> >> 1) defining the set of optimizations that need to be skipped. >> 2) defining the set of functions that trigger the special processing. >> >> >> For (1) I would propose that three measures be made of each function. >> These measures should be made before inlining occurs. The three measures >> are the number of variables, the number of statements, and the number of >> basic blocks. > > Why before inlining? These three numbers can change quite significantly > as a function passes through the pass pipeline. So we should try to keep > them up-to-date to have an accurate measurement. > I am flexible here. We may want inlining to be able to update the numbers. However, I think that we should drive the inlining agression based on these numbers. > Otherwise the proposal sounds reasonable but we should make sure the > limits we impose allow reproducible compilations for N x M cross > configurations and native compilation on different sized machines. > I do not want to get into the game where we are looking at the size of the machine and making this decision. Doing that would make it hard to reproduce bugs that come in from the field. Thus, I think that the limits (or functions) should be platform independent. > Richard.
Re: compiling very large functions.
Richard Guenther wrote: > On 11/4/06, Kenneth Zadeck <[EMAIL PROTECTED]> wrote: >> Richard Guenther wrote: >> > On 11/4/06, Kenneth Zadeck <[EMAIL PROTECTED]> wrote: >> >> I think that it is time that we in the GCC community took some >> time to >> >> address the problem of compiling very large functions in a somewhat >> >> systematic manner. >> >> >> >> GCC has two competing interests here: it needs to be able to provide >> >> state of the art optimization for modest sized functions and it >> needs to >> >> be able to properly process very large machine generated functions >> using >> >> reasonable resources. >> >> >> >> I believe that the default behavior for the compiler should be that >> >> certain non essential passes be skipped if a very large function is >> >> encountered. >> >> >> >> There are two problems here: >> >> >> >> 1) defining the set of optimizations that need to be skipped. >> >> 2) defining the set of functions that trigger the special processing. >> >> >> >> >> >> For (1) I would propose that three measures be made of each function. >> >> These measures should be made before inlining occurs. The three >> measures >> >> are the number of variables, the number of statements, and the >> number of >> >> basic blocks. >> > >> > Why before inlining? These three numbers can change quite >> significantly >> > as a function passes through the pass pipeline. So we should try >> to keep >> > them up-to-date to have an accurate measurement. >> > >> I am flexible here. We may want inlining to be able to update the >> numbers. However, I think that we should drive the inlining agression >> based on these numbers. > > Well, for example jump threading and tail duplication can cause these > numbers to significantly change. Also CFG instrumentation and PRE > can increase the BB count. So we need to deal with cases where an > optimization produces overly large number of basic blocks or > instructions. > (like by throtting those passes properly) > I lean to leave the numbers static even if they do increase as time goes by. Otherwise you get two effects, the first optimizations get to be run more, and you get the wierd non linear step functions where small changes in some upstream function effect the down stream. kenny > Richard.
Re: compiling very large functions.
Paolo Bonzini wrote: > >> While I agree with you, I think that there are so many things we are >> already trying to address, that this one can wait. I think we've >> been doing a very good job on large functions too, and I believe that >> authors of very large functions are just getting not only what they >> deserve, but actually what the expect: large compile times >> (superlinear). > > Not too mention, that these huge functions are usually central to the > program. If GCC decided that it is not worth optimizing the > machine-generated bytecode interpreter of GNU Smalltalk, for example, > I might as well rewrite it in assembly (or as a JIT compiler). Same > for interpret.cc in libjava, though it is a tad smaller than GNU > Smalltalk's interpreter. > > Unlike the authors of other VM's, I have no problem writing code so > that the *latest* version of GCC will do its best, instead of > complaining that GCC compiles my code worse on every release. So, I > am ok with GCC doing stupid things because of bugs that I/we can fix, > but not with GCC just giving up optimization on code that has always > been compiled perfectly (in one/two minutes for about 30,000 lines of > machine-generated code, despite being chock-full of computed gotos), > that *can* be optimized very well, and that is central to the > performance of a program. > > Paolo I actually think that you small talk example is the exception and not the rule. I would guess that the vast majority of very large functions are machine generated simulations where the optimizer most likely provides little benefit. In the case of dataflow, reaching defs is much more expensive than simple liveness, not because there are more bit vectors (there are exactly the same number) but because there are an order of magnitude more bits in those bit vectors than in live variables and it just takes more time and space to move them around. The thing is that even as memories get larger, something has to give. There are and will always be programs that are too large for the most aggressive techniques and my proposal is simply a way to gracefully shed the most expensive techniques as the programs get very large. The alternative is to just to just shelve these bugs and tell the submitter not to use optimization on them. I do not claim to know what the right approach is. kenny
Re: compiling very large functions.
Eric Botcazou wrote: >> AFAIK not one of the tree optimizers disables itself, but perhaps we >> should. The obvious candidates would be the ones that require >> recomputation of alias analysis, and the ones that don't update SSA >> info on the fly (i.e. require update_ssa, which is a horrible compile >> time hog). >> > > Tree alias analysis can partially disable itself though: > > /* If the program has too many call-clobbered variables and/or function > calls, create .GLOBAL_VAR and use it to model call-clobbering > semantics at call sites. This reduces the number of virtual operands > considerably, improving compile times at the expense of lost > aliasing precision. */ > maybe_create_global_var (ai); > > We have found this to be quite helpful on gigantic elaboration procedures > generated for Ada packages instantiating gazillions of generics. We have > actually lowered the threshold locally. > > I would like to point out that the central point of my proposal was to have the compilation manager be the process that manages if an optimization is skipped or not rather than having each pass make a decision on it's own. If we have a central mechanism, then it is relative easy to find some sweet spots. If every pass rolls its own, it is more difficult to balance. Kenny
Re: compiling very large functions.
Andrew MacLeod wrote: > On Sat, 2006-11-04 at 15:17 -0500, Kenneth Zadeck wrote: > >> ld. >> > > >> However, I think that before anyone starts hacking anything, we should >> come to a consensus on an overall strategy and implement something >> consistent for the entire compiler rather than attack some particular >> pass in a manner that only gets us pass the next pr. >> > > I think we really do need to deal with it on a per pass basis. We > occasionally get a monster PR which shows half a dozen (or more) serious > time hogs. We more often get a pathological case for a specific pass. > > In each and every one of these cases, someone has taken a look at how to > improve the time hog(s). Usually the result is a leaner/faster and > improved pass, although it sometimes takes a new release for it to > happen :-). Occasionally, we turn something off. A couple of PRs with > massive functions are primarily responsible for the pending changes to > out-of-ssa... And those changes will benefit small functions as well as > the large ones. > > The problem with trying to solve this problem on a per pass basis rather than coming up with an integrate solution is that we are completely leaving the user out of the thought process. There are some uses who have big machines or a lot of time on their hands and want the damn the torpedoes full speed ahead and there are some uses that want reasonable decisions made even at high optimization. We need to give them any easy to turn knob. I am not saying that my original proposal was the best of all possible worlds, but solving hacking things on a pass by pass or pr by pr basis is not really solving the problem. kenny
new DATAFLOW PORTING wiki available.
I have posted a new wiki intended to help port maintainers with issues that may arise with the dataflow branch. The new wiki page can be found at http://gcc.gnu.org/wiki/DataflowPorting. Thanks, Kenny
re: Some thoughts and quetsions about the data flow infrastructure
> On Sunday I had accidentally chat about the df infrastructure on > IIRC. I've got some thoughts which I'd like to share. > > I like df infrastructure code from the day one for its clearness. > Unfortunately users don't see it and probably don't care about it. > With my point of view the df infrastructure has a design flaw. It > extracts a lot of information about RTL and keep it on the side. It > does not make the code fast. It would be ok if we got a better code > quality. Danny told me that they have 1.5% better code using df. It > is a really big improvement (about half year work for all compiler > team according to Proebsting's law). IMHO, it could justify promised > 5% compiler slowness. > Vlad, I think that different people can have different perspectives. You have been working on improving the register allocation for several years, but very little has come of it because the reload infrastructure does not suit itself to being integrated with modern register allocators. You have spent several years of work without touching the underlying problem that reload is generally going to defeat almost any effort to get good benefits out of a new register allocator. I do not want to denigrate your work in any way, but at the end of the day, any new register allocator will be compromised by the existing reload implementation. I am interested bringing the rest of the back end into the modern world. While some of the passes can and should be moved into the ssa middle end of the compiler, there are several optimizations that can only be done after the details of the target have been fully exposed. My experience with trying to do this was that the number one problem was that the existing dataflow is in many cases wrong or too conservative and that it was not flexible enough to accommodate many most modern optimization techniques. So rather than hack around the problem, I decided to attack the bad infrastructure problem first, and open the way for myself and the others who work on the back end to benefit from that infrastructure to get the rest of passes into shape. There are certainly performance issues here. There are limits on how much I, and the others who have worked on this have been able to change before we do our merge. So far, only those passes that were directly hacked into flow, such as dce, and auto-inc-dec detection have been rewritten from the ground up to fully utilize the new framework. However, it had gotten to the point where the two frameworks really should not coexist. Both implementations expect to work in an environment where the information is maintained from pass to pass and doing with two systems was not workable. So the plan accepted by the steering committee accommodates the wholesale replacement of the dataflow analysis but even after the merge, there will still be many passes that will be changed. I would have liked to have the df information more tightly integrated into the rtl rather than it being on the side. It is cumbersome to keep this information up to date. However, the number of places in the backends that depend on the existing rtl data structures apis make such a replacement very difficult. I do believe that by the time that we merge the branch, we will be down to a 5% compile time regression. While I would like this number to be 0% or negative, I personally believe that having precise and correct information is worth it and that over time we will be able to remove that 5% penalty. As far as the other regressions, these will be dealt with very soon. Kenny
Re: Some thoughts and quetsions about the data flow infrastructure
Vladimir N. Makarov wrote: >> Vlad, >> I think that different people can have different perspectives. >> You have been working on improving the register allocation for several >> years, but very little has come of it because the reload >> infrastructure does not suit itself to being integrated with modern >> register allocators. You have spent several years of work without >> touching the underlying problem that reload is generally going to >> defeat almost any effort to get good benefits out of a new register >> allocator. I do not want to denigrate your work in any way, but at >> the end of the day, any new register allocator will be compromised by >> the existing reload implementation. >> > > Ken, to be exact I've been working for a bit more 2 years on the > register allocator itself. Probably you don't know I attacked exactly > underlying problem - the reload. If you look at YARA branch, it is > register allocator without the reload. And I am really worring that > there is a little outcome. But ridding the reload off is so complex > problem. I can not work on it for a few more years. I need some result > too. > > Using the experience I've got from YARA branch I've created another > register allocator (IRA branch) to make it ready for gcc-4.4. IRA > still uses reload. But may be I have higher standards. I don't want > to include the code in sake of inclusion. I have no worse code > generated by IRA than the current register allocator. The code size > is smaller for the most platforms. For some platforms I have better > generated code (up to 4% on SPECINT2000 in 32-bt mode: bo exact 1930 > vs 1850 for Core2 according this weekend benchmarking). > > Actually I could make IRA ready for gcc4.3. It works for x86, x86_64, > itanium, ppc, sparc, s390, arm. It is optional so other platform can > use the current register allocator. But I don't to rush. > > But still you are right the reload comprising the generated code. If > you are interesting in my opinion, df infrastracture is a tiny part of > RA implementation problem (and as I understand insn scheduler and code > selection). Actually IRA uses he tDF infrastracture but it can be easily > to be switched to the old life anaysis. > >> I am interested bringing the rest of the back end into the modern >> world. While some of the passes can and should be moved into the ssa >> middle end of the compiler, there are several optimizations that can >> only be done after the details of the target have been fully exposed. >> > > Bring the rest of the back end into the modern word is too chalenging > task. If you really want it, imho you should attack RTL and machine > descriptions. But this task is a magnitude more difficult than > introducing tree-SSA. > It is a hard project and you are right that replacing rtl would be better. However, I do know how to do that either from a logistical point of view or from the point of view of having a better replacement that covers all of the platforms as well. However there are a lot of sins in the back end and a large number of them are either being directly addressed by this replacement or are now accessible. The addition of df will allow others to introduce better technology. >> My experience with trying to do this was that the number one problem >> was that the existing dataflow is in many cases wrong or too >> conservative and that it was not flexible enough to accommodate many >> most modern optimization techniques. So rather than hack around the >> problem, I decided to attack the bad infrastructure problem first, and >> open the way for myself and the others who work on the back end to >> benefit from that infrastructure to get the rest of passes into shape. >> > > I am not against a new DF infrastracture, more accurate one. I am > against a slower infrastracture. > >> There are certainly performance issues here. There are limits on >> how much I, and the others who have worked on this have been able to >> change before we do our merge. So far, only those passes that were >> directly hacked into flow, such as dce, and auto-inc-dec detection >> have been rewritten from the ground up to fully utilize the new >> framework. However, it had gotten to the point where the two >> frameworks really should not coexist. Both implementations expect >> to work in an environment where the information is maintained from >> pass to pass and doing with two systems was not workable. So the >> plan accepted by the steering committee accommodates the wholesale >> replacement of the dataflow analysis but even after the merge, there >> will still be many passes that will be changed. > > Does it means that compiler will be even more slower? > >> I would have liked >> to have the df information more tightly integrated into the rtl >> rather than it being on the side. It is cumbersome to keep this >> information up to date. However, the number of places in the >> backends that depend on the existing rtl data structu
Re: Some thoughts and quetsions about the data flow infrastructure
Richard Kenner wrote: >> Regs_ever_live is the poster child of this. In theory regs_ever_live is >> easy, it is just the set of hard registers that are used. In practice >> this is a disaster to keep track of because it was only updated >> occasionally and its values are "randomly" changed by the backends in >> totally undocumented ways. Maintaining regs_ever_live requires a lot of >> special mechanism that slows down a the incremental scanning. >> > > The history here, by the way, is that it was originally very simple and just > supposed to provide a "quick and easy" way of having a conservative view of > which registers *weren't* ever used. So it was set when a register might > possibly be used. That was indeed easy. > > But then people wanted to be able to know *for sure* which registers were > used, so mechanisms were added to clear it out when we knew a register > *wasn't* used, which added the complexity you mention. > > This is a problem with a lot of the ad-hoc structures used: they were > originally meant for one specific purpose and often used very locally and > were reasonably well-designed for that purpose, but then were used more > globally and/or for other purposes and they weren't quite so well designed > for that purpose anymore, but nobody went to the troule to change them. > > I strongly support a new, common infrastructure that will allow all of these > older structures to be replaced. But the history is important in my opinion > because it means that we need to think as generally as possible and to ensure > we come up with as broad a structure as possible in order both to replace the > current structures, but also to support many other uses in the future. What > what I understand, the current mechanism does that, but I think it's > important to keep this criterion in mind when evaluating any possible > "competitors". > I very much understand the "I just need to make this one little tweak" road to hell. By definition, this was no one's plan. It will most likely take us all of stage 2 to understand and replace all of the one little tweaks with respect to regs_ever_live.
RE: new auto-inc-dec pass
> Hi, > > I'mm currently looking at the dataflow branch for m68k mainly because of > the new auto-inc-dec pass, I worked a bit on the old code in flow.c, but > considering the new pass, I think it doesn't make much sense to continue > it. > I agree. With high probability the auto inc detection will be replaced with the new auto-inc-dec.c on the dataflow branch. I would certainly encourage you to enhance the new code in auto-inc-dec.c where you find that it is lacking. That code does get all of the cases that were found by the old version of flow and does get many things that were missed. But there is no expectation that this is the last word in auto-inc-dec detection. My main concern was to cleanly separate this function from flow without any regressions. > > > I'm attaching my current patch and some test cases, but before delving > deeper into the code, I have a few questions and I'd like to hear some > opinions. Currently I only looked at the output and it misses some > opportunities (not that my patch can do everything either :) ). I agree that the auto-inc-dec code will already get much of this. In particular there is no longer (unless the machine requirres it) a reliance that the increment match the size of the load or store. > > One case is about multiple increments, the tree optimizer merges them and > increments the register only once, so if one only looks at the size of the > pointer value one misses them, e.g. something like this: > > (set (mem (reg)) (x)) > (set (mem (plus (reg) (const_int 4))) (x)) > (set (reg) (plus (reg) (const_int 8)) > > Another case are multiple uses of the register within an instruction, the > old code didn't do this at all, the new code seems to do a bit better, but > there as a rather tricky case I looked into, e.g. > > (set (mem) (plus (mem) (reg))) > > could be turned into: > > (set (post_inc (mem)) (plus (mem) (reg))) > I do find this a little wierd but if it makes a difference go for it. > or > > (set (mem) (plus (pre_dec (mem)) (reg))) > > This is at least what operands_match_p() accepts, but hasn't been > generated in a long time and has the potential to trigger bugs in the > back end (e.g. m68k only looks only at one of the operands). > A question would be now if there is a general interest in reactivating > this, so gcc could generate a instruction like "add.l %d0,(%a0)+" for > m68k. > > The more general question is whether someone is already working on further > improvements, so I won't duplicate anything (which wouldn't be before I > looked into the remaining m68k dataflow problems anyway. :) ). > We have stopped development on this piece of code. It is efficient and works with no regressions on arm, ppc and ia-64. So it meets the minimun criteria for inclusion onto mainline with the upcoming merge. However, that should in no way stop anyone else from enhancing it further. > The dataflow porting document mentions other possible, but doesn't mention > any examples. Anything I might have to look out for regardings the m68k > post_inc/pre_dec addressing modes? > > bye, Roman Kenny
Re: [DataFlow Branch] Query regarding an ICE in global.c
Ramana Radhakrishnan wrote: > Hi , > > I am working on integrating a private port into the Dataflow branch and > am running into a couple of issues with ICEs in global.c . The ICE is > at gcc_assert ( REGS_LIVE (i)) at line 534 in global_alloc in > global.c .. > > However because of the way we generate calls in the backend with an > extra clobber register for calculating the return address. The temporary > which gets clobbered is here. > > (call_insn:HI 32 29 34 > 4 ../../../../gnu/libgcc/../gcc/unwind-dw2-fde.c:487 (parallel [ > (set (reg:SI 1 $r1) > (call (mem:SI (reg/v/f:SI 145 [ fde_compare ]) [0 S4 > A32]) > (const_int 0 [0x0]))) > (use (const_int 0 [0x0])) > (clobber (reg:SI 31 $link)) > (clobber (reg:SI 153)) > ]) 40 {call_value_indirect} (expr_list:REG_DEAD (reg:SI 3 $c3) > (expr_list:REG_DEAD (reg:SI 2 $r2) > (nil))) > (expr_list:REG_DEP_TRUE (use (reg:SI 3 $r3)) > (expr_list:REG_DEP_TRUE (use (reg:SI 2 $r2)) > (expr_list:REG_DEP_TRUE (use (reg:SI 1 $r1 [ ob ])) > (nil) > > > > > Is it something to do with the way that REGS_EVER_LIVE is calculated ? > If so where does this get updated in the new infrastructure. I am still > feeling my way through the df code , so any help would be appreciated. > Thanks in advance. > > This is with revision 123253 of the DF branch which is slightly older . > I am considering moving up but wanted to check before that . > > > cheers > Ramana > > > > There may have been a few fixes that might help so you should try updating first. There is a fix was that causes the REG_N... information to be rebuilt at the start of global rather than reusing the info computed during local. If you are adding this pseudo very late, this will likely fix the problem. Fixing your kind of bug was not the reason for this fix. The fix solves an issue where the regs live across a call computed during local is different than the way it is computed during global. This kind of information is built by the RI problem when df_analyze is called and that problem is not resolved when df_analyze is called inside global. Hope this fixes you issue. Kenny > > > > > > > >
Re: Canonical form of the RTL CFG for an IF-THEN-ELSE block?
Jan Hubicka wrote: Hi, We would like to know if there is some way to find the true and false branches of a conditional jump in RTL. In the tree CFG, we have two edge flags for that, EDGE_{TRUE,FALSE}_VALUE, but those flags have no meaning for the RTL CFG. So our question is, is there some other way to tell what edge will be taken in a conditional jump if the condition is true? It seems that some passes assume a canonical form of IF-THEN-ELSE even on RTL. From ifcvt.c:find_if_header: /* The THEN edge is canonically the one that falls through. */ if (then_edge->flags & EDGE_FALLTHRU) ; else if (else_edge->flags & EDGE_FALLTHRU) { edge e = else_edge; else_edge = then_edge; then_edge = e; } else /* Otherwise this must be a multiway branch of some sort. */ return NULL; On the other hand, in cfgexpand.c:expand_gimple_cond_expr we have, false_edge->flags |= EDGE_FALLTHRU; and loop-unswitch.c assumes that the BRANCH_EDGE is the true_edge: true_edge = BRANCH_EDGE (unswitch_on_alt); false_edge = FALLTHRU_EDGE (unswitch_on); So which is it? Is BRANCH_EDGE always taken if the condition is true, or FALLTHRU_EDGE, or do you have to look at the condition to know? Who knows an answer? :-) :) It depends on how the conditional is constructed. If you use get_condition the edge taken when conditional is true is always BRANCH_EDGE if some exists (it is possible to have conditional jump to the following instruction where you have only one edge with EDGE_FALLTHRU flag). Otherwise you have to look into conditional jump RTL yourself to figure out if it has form (set (pc) (if_then_else (cond) (pc) (label_ref xxx)) or (set (pc) (if_then_else (cond) (label_ref xxx) (pc)) In the first case we are taking barnch edge when conditional is false. It seems there are two approaches to solving this problem, constructive and preservative. According to danny, the constructive approach is what is used in the rtl level if-conversion and is dead slow. In the preservative approach, you keep the markings that are inserted from the tree code and track their changes as the rtl gets modified. This assumes that we do not throw away the cfg right before rtl generation, which may be a big assumption or at least you add notes in the rtl that preserves this information for later reconstruction. I have to admit that I am always a fan of preserving information rather than reconstructing it. This need to just keep rescanning the intermediate representation over and over again is a big flaw in gcc and contributes adversely to the performance. kenny Honza Gr. Steven
Re: Proposed resolution to aliasing issue.
> Mark Mitchell wrote: > > > > struct A {...}; > > struct B { ...; struct A a; ...; }; > > > > > > void f() { > >B b; > >g(&b.a); > > } > > > > > >does the compiler have to assume that "g" may access the parts of > >"b" outside of "a". If the compiler can see the body of "g" than > >it may be able to figure out that it can't access any other > >parts, or figure out which parts it can access, and in that case > >it can of course use that information. The interesting case, > >therefore, is when the body of "g" is not available, or is > >insufficient to make a conclusive determination. > > > > I attended a UK C++ panel meeting yesterday, and took the opportunity > to solicit opinions on this. The question I posed was > struct A { > ... > T1 a; > T2 b; > }; > void g(T1 &a); > void Foo () { > A v; > v.b = 2; > g (v.a); > if (v.b == 2) ... > } > Does the compiler have a right to presume v.b does indeed == 2 in the if > condition? -- assuming T2 is a suitable type for that :) > > > After I explained the optimization (and the related structure splitting > optimization), the general consensus was 'yes that would be a useful > optimization'. But no one was sufficiently confident of proving it > was allowable. The opinion was expressed that if it was not allowable, > there was a bug in the std. > > > The observation was made that if A is non-POD, one cannot play offsetof > tricks to get from A::a to A::b, so the optimization is safe on non-PODs. > (Of course one would have to prove the address of 'v' did not escape, > so I guess the ctor and dtor would need to be trivial or visible.) > > > One of the panel members is looking at C++'s memory model WRT > multithreading. This question has a direct impact there too, as > separate threads might be operating on v.a and v.b. He indicated > he would consider the issue. > > > I also outlined the approach gcc would take with a compile time > switch and an attribute. The preference was expressed that > the attribute should be of the form > void badfunc (A & __I_AM_BAD__ m); > rather than > void goodfunc (A & __I_AM_GOOD__ m); > because (a) badfuncs are more than likely rare and (b) it would be a useful > aid to the programmer.[1] Mark outlines an __I_AM_GOOD__ attribute, I think > it would be better to have both flavours and then the compiler switch can > specify which way the default goes. > I would like to point out some problems with this approach. Consider the case where you have three modules: A, B and C. Each with a single function a, b, and c respectively where a calls b and b calls c. Also assume that c has the __I_AM_BAD__ attribute. What is known when compiling the A module? Function a does not know that b calls c. Are you going to require that b's prototype also have the __I_AM_BAD__ attribute because it calls c? Where does this stop? I believe that the only way to have this work is to have the attribute be associated with the data type. This attribute discriminates this type from a similar type that does not have the attribute, i.e. you cannot just assign a pointer to a bad int to a pointer to an int. This will force the programmer to have separate versions of functions that take the bad pointer and the good pointer but this lets the compiler compiler the good functions in a manner that rewards the good. It is also easy to track thru the maze of separate compilation. Kenny Disclaimer: I have never written a single line of C++ in my life nor have I ever read any C++ books or specifications. I am firmly rooted in the ML, Modula-3 and Java school of strongly typed, well defined languages. > > nathan > > [1] it was of course noted that that looked stunningly like 'restrict', and > the suggestion that it be spelled 'noalias' was jokingly made :) > > > -- > Nathan Sidwell:: http://www.codesourcery.com :: CodeSourcery LLC > [EMAIL PROTECTED]:: > http://www.planetfall.pwp.blueyonder.co.uk
Re: Existing tree functionality?
I hope to have my code checked in today. Look in ipa-reference.c Kenny Daniel Berlin wrote: Most of this can be found in the cgraph nodes. The rest requires scanning the IL. Ken Zadeck should have code to do this. On Wed, 2005-07-06 at 08:32 -0400, Michael Tegtmeyer wrote: Hello, Is there existing functionality somewhere to sweep a function and collect all externally visible variables at the tree level or do I need to roll my own? I've looked in tree.h and grepped around as much as I could but I haven't found anything obvious. Thanks in advance, Mike Tegtmeyer
why are there many copies of function_decls that have the same information including the same uid?
Is this a bug or a feature?
Re: why are there many copies of function_decls that have the same information including the same uid?
I was wrong, I misread some debugging output. Sorry, kenny Andrew Pinski wrote: On Jul 11, 2005, at 8:50 AM, Daniel Berlin wrote: On Mon, 2005-07-11 at 08:40 -0400, Kenneth Zadeck wrote: Is this a bug or a feature? Bug. where is it occurring? I want to say the C++ front-end since that is where a couple of ICEs have showed up because of that. -- Pinski
more on duplicate decls
I have been trying to cleanup the location where the persistent ipa information is stored. The original way that I did this was that I had a field in the var_ann structure that pointed to the ipa_information. Now that danny has reorganized the decls, the plan was to just add a field to the function_decl. Danny has suggested that if I was going to add such a field, it should be a pointer to the cgraph_node and then store the persistent ipa information there. I noticed that if I did this, there should be a lot of stuff that could fall away. There would be no need to have the hash table that points from function_decls to cgraph nodes as well as there was no need for the master_clone field since the cgraph node that the decl pointed to would be just this good. All of this sounds great... smaller, faster. This has not gone as easy as I had hoped. Danny found the first stumbling block yesterday that the c decl merging needed to be enhanced. The second bug appears to be harder. In c++, there really can be multiple function_decls with the same uid. The kludge to handle them is documented in cp/name-lookup.c around line 670 and one test case to generate them is in gcc/testsuite/g++.dg/opt/inline2.C. There are several ways to go about fixing this: One is to just abandon my fix and let others (such as the code sourcery people deal with it when they go to build a real symbol table.) It seems like a better way is to build a table of merges that need to be done and find some place after the c++ front end is finished but before the cgraph gets built and do one scan of the code and replace all of the offending decls. Does such a location exist? Is this the correct plan? Kenny
Re: more on duplicate decls
Mark Mitchell wrote: Kenneth Zadeck wrote: The kludge to handle them is documented in cp/name-lookup.c around line 670 Ugh. IMO, the right thing here is that there should be only one FUNCTION_DECL for a given function, ever, period. Default arguments are not part of "the function" in C++; they are an aspect of particular declarations of the function. The problem we're having is that we associate them with the FUNCTION_DECL, rather than with the bindings (mappings from names to FUNCTION_DECLs). Unfortunately, fixing that properly is a rather large change. It seems like a better way is to build a table of merges that need to be done and find some place after the c++ front end is finished but before the cgraph gets built and do one scan of the code and replace all of the offending decls. After the C++ front end is done, it would be OK to canoanicalize all FUNCTION_DECLs with a single UID into one. A stupid way of doing that would be to just walk the entire translation unit, and whenever you find a reference to a non-canonical copy of the FUNCTION_DECL, re-point it at the right one. The C++ front end always operates in unit-at-a-time mode, so, yes, you could do this when the C++ front end is finished. Are you saying that if I wait til the front end is finished that I do not have to worry about the default args problem you mention above? Should I fix it this simple way or should I let a c++ front end person fix it as the decls are created? kenny
Re: tree-ipa-type-escape slow
the splay tree in question is there as part of the type unification. This is required because our design for combining modules does not unify the types and this messes up the type escape analysis. Because of this, the type name must be the key. However, there is the possibility that doing some special treatment for unnamed types is possible. I did not look into it. I do not understand the front ends enough to know how often these are created. Is this a c++ issue? or an issue associated with only this benchmark? kenny Richard Guenther wrote: I'm seeing splay tree usage from tree-ipa-type-escape in the first places of profiles from tramp3d-v4: samples %image name app name symbol name 1418146.5462 no-vmlinux no-vmlinux (no symbols) 71471 3.2992 cc1plus cc1plus splay_tree_splay_helper 51114 2.3595 cc1plus cc1plus comptypes 49001 2.2619 cc1plus cc1plus compute_may_aliases --- 1116015.6147 cc1plus cc1plus splay_tree_splay 5985583.7473 cc1plus cc1plus splay_tree_splay_helper 71471 3.2992 cc1plus cc1plus splay_tree_splay_helper 7147143.1489 cc1plus cc1plus splay_tree_splay_helper [self] 5985536.1360 cc1plus cc1plus splay_tree_splay_helper 12966 7.8279 libc-2.3.5.socc1plus strcmp 11404 6.8849 cc1plus cc1plus compare_type_brand 8650 5. cc1plus cc1plus splay_tree_compare_pointers --- looking into the ipa-type-escape code I wonder if there's a reason to use TYPE_NAME as a key instead of TYPE_UID. Also if TYPE_NAME is necessary, instead of using "" using NULL and special casing that will be a lot faster, too. Can you elaborate on what exactly the datastructure is used for? Maybe for a large number of types it isn't really the most efficient one. Thanks, Richard.
Re: [RFC] propagating loop dependences from trees to RTL (for SMS)
I will pull a patch together tomorrow. There is currently nothing in the code for keeping the region stuff up to date as changes are made to the cfg. For most changes this would not be hard, but for some it is really hard. Daniel Berlin wrote: On Thu, 2005-09-22 at 18:49 -0700, Devang Patel wrote: On Sep 22, 2005, at 2:32 AM, Steven Bosscher wrote: On Sep 22, 2005 11:25 AM, Zdenek Dvorak <[EMAIL PROTECTED]> wrote: 4. Other ideas? Preserving the information about loops throughout the optimizations, and just keeping this information attached at the loop description would by far be the cleanest approach -- admitedly also the one that requires the greatest amount of work. Shouldn't the regions patch allow us to preserve loops quite easily? Any pointer to this "regions patch" ? Thanks, Ask kenny for a copy.
Re: Link-time optimzation
On Wed, Nov 16, 2005 at 02:26:28PM -0800, Mark Mitchell wrote: > > http://gcc.gnu.org/projects/lto/lto.pdf > > Section 4.2 > > What is the rationale for using a stack-based representation rather > than a register-based representation? A infinite register based > solution would seem to map better onto gimple, making it easier to > understand the conversion into and out of. The required stacking > and unstacking operations would seem to get in the way. > > C.f. plan9's inferno, rather than jvm or cil. What we wanted was something that was portable and easy to specify but also easily exported from and imported into GCC. With respect to those constraints, JVM, CIL and LLVM do not make the cut because they are different enough from the tree-gimple code that we have to make the translation into and/or out of gcc difficult. This is not NIH, just being pragmatic about not wanting to have to commit resources into things that are not important. A stack machine representation was chosen for the same reason. Tree gimple is a series of statements each statement being a tree. Smashing the trees and introducing temps is easy on the output side but requires a lot of work on the input side. I am not a fan of our tree representations, but I did not believe that changing them should be a necessary to get to link time optimization. If we decide we want to get rid if trees as an intermediate form, this decision should change. A well designed stack machine also provides for a very tight encoding. It is very desirable from a size of portable code point of view to minimize the number of temps that you create. You can do this in several ways: 1) Do some register allocation of the temps so that they are reused. This is non trivial to undo (but truely doable), especially where you wish to not adversely impact debugging. 2) Just generate a lot of temps and hope that some other form of compression will save the day. 3) Use a stack to represent the intermediate nodes of the tree. This is what we chose to do. It is trivial to generate the stack code from a single walk of the tree. It is trivial to regenerate the tree from a single pass over the stack code. The stack machine that we have in mind will be as stripped down as possible. The idea is just to get the trees in and get them back out. Kenny
Re: Link-time optimzation
Mark Mitchell <[EMAIL PROTECTED]> writes: > http://gcc.gnu.org/projects/lto/lto.pdf > > Section 4.2 (Executable Representation) describes the GVM as a stack > machine, and mentions load, store, duplicate, and swap operations. > But it also discusses having registers which correspond to GIMPLE > local variables. The way I put this together is that a MODIFY_EXPR > which sets a local variable, say, will be converted to something that > computes the expression using a stack and then assigns the value to a > register. It's easy to see how to convert such a MODIFY_EXPR into a > stack machine and back. But it's not easy to see why that stack > machine is going to need, e.g, a swap operation. There is no GIMPLE > swap operator. So I may still be confused about something. You are most likely correct. We either do not need a swap or do not want to encourage people to try to optimize the trees so much that they need a swap. > > Ian
Re: Link-time optimzation
> >> Thanks for woking on this. Any specific reason why using the LLVM > >> bytecode wasn't taken into account? > > > > It was. > > A large number of alternatives were explored, including CIL, the JVM, > > LLVM, etc. > > > >> It is proven to be stable, high-level enough to > >> perform any kind of needed optimization, > > > > This is not true, unfortunately. > > That's why it is called "low level virtual machine". > > It doesn't have things we'd like to do high level optimizations on, > > like dynamic_cast removal, etc. > > > Anyway, *slightly* extending an existing VM which already exists, is > production-ready, is GPL compatible, is supported by a full toolchain > (including interpreters, disassemblers, jitters, loaders, optimizers...) looks > like a much better deal. Also, I'm sure Chris would be willing to provide us > with all the needed help. > > I also think CIL would have egregiously worked. I'm sure the reasons to refuse > it are more political than tecnical, so it's useless to go into further > details > I presume. I do not think that CIL really would have worked. CIL has been carefully crafted to support what Microsoft wants it to support and unrestricted C, C++, and Fortran are not in that mix. Remember that Microsoft wants to be able to take a CIL program produced on one (?Microsoft?) platform and run it on another (?Microsoft?) platform. By the time we get to where we want to produce the iterchangable code, the il has a lot of platform knowledge (for instance generated by the C preprocessor) that cannot be easily accounted for. > > Giovanni Bajo
Re: LLVM/GCC Integration Proposal
> Specifically, advocates of the recent link-time-optimization proposal > [1] may claim that exposing all of the information that DWARF captures > (e.g. about the type system) at link-time is a good thing: it makes it > easier to do specific high-level optimizations, because all of the > information is trivially available. However, they are ignoring the > issue of handling multiple languages at the same time: an optimizer > with this design has to be aware of (and be able to update) all > possible source-language information and must be extended in the > future as new languages are added (this is the problem with universal > compilers, dating back to the UNCOL design). Also, the linker needs to > know how to combine declarations defined in different languages, > because an interprocedural optimizer only want to see one declaration > for each logical variable (for example). This quickly becomes > difficult and messy, which is presumably why the link-time proposal > allows the linker to "give up" linking two translation units. The reason for the complexity of the type system handling in our proposal was motivated primarily by two concerns: 1) We need to keep track of the types so that they are available for debugging. As LLVM does not support debuggers at this point, it may be difficult to see through all of the issues required to keep track of the information while still doing good optimization. In the end we will have to produce a set of debugging info for the optimized program, loosing all the type information before you start did not seem the correct plan. 2) GCC has a very large user community that knows nothing about compilers except how to shepherd their programs through GCC and run the result. Unless I am mistaken about the LLVM, it has no such user community. I personally cannot guarantee that GCC (or for that matter any optimizing compiler) can correctly cross inline and compile a program if the types in one module are not consistent with the types in another module. Just because the program happens to work correctly when separately compiled is not enough. When Mark and I started working on this proposal (and later the rest of the volunteers) we decided that this was not going to be either an academic exercise or just something to run benchmarks. What that means to me is that the link time optimizer needs to be able to either generate correct code or give up in some predictable manner. Having the compiler push forward and hope everything turns out OK is not enough. Discretion is the better part of valor. I think that taking advantage of mixed C, C++ or C and Fortran programs is going to be hard. But it is what the GCC customers want and there is a desire to accommodate them if possible. On the positive side, LLVM has a lot more experience in managing whole program compilation, and it is a much cleaner software base. It would be a good to find some mechanism to take advantage of that experience. Trying to make the hippo dance is not really a lot of fun. Kenny
question about registers that are set on entry to a function.
Ian and Honza, Here is a first draft of the a function to find the set of registers that are (will) be defined on entry to function. I generated this from our conversation on chat and by looking around. This function is necessary in the new version of df because, unlike flow, we do both forwards and backward propagation to get better information. Thus there is no code in flow to give me a hint about this. Furthermore, the other places that do forwards analysis like make_accurate_live analysis punt on the hard registers and just assume that they are all live on entry to the function. 1) Do you believe that this code could be correct? 2) Most of the paragraphs are protected by either reload_completed or epilogue_completed (which has a comment at the its point of declaration that it is true for both the prologue and epilogue). Is it true that after these epochs, that there will really be code in the first basic block that will def these registers? I.e. are any of them implicitly set before the function is called so I will not find a def? 3) Am I missing any classes of registers? Any one else who has any opinions (or secret, undocumented knowledge of ports) is encouraged to answer back also. Thanks, Kenny /* Record the (conservative) set of hard registers that are defined on entry to the function. */ static void df_record_entry_block_defs (struct dataflow * dflow) { unsigned int i; bitmap_iterator bi; rtx r; struct df * df = dflow->df; bitmap_clear (df->entry_block_defs); if (! (df->flags & DF_HARD_REGS)) return; /* Once the prologue has been generated, all of these registers should just show up in the first regular block. */ if (HAVE_prologue && epilogue_completed) { for (i = 0; i < FIRST_PSEUDO_REGISTER; i++) { if (FUNCTION_ARG_REGNO_P (i)) bitmap_set_bit (df->entry_block_defs, i); } #ifdef EH_USES for (i = 0; i < FIRST_PSEUDO_REGISTER; i++) if (EH_USES (i)) { bitmap_set_bit (df->entry_block_defs, i); } #endif if (REG_P (INCOMING_RETURN_ADDR_RTX)) bitmap_set_bit (df->entry_block_defs, REGNO (INCOMING_RETURN_ADDR_RTX)); /* If STATIC_CHAIN_INCOMING_REGNUM == STATIC_CHAIN_REGNUM only STATIC_CHAIN_REGNUM is defined. If they are different, we only care about the STATIC_CHAIN_INCOMING_REGNUM. */ #ifdef STATIC_CHAIN_INCOMING_REGNUM bitmap_set_bit (df->entry_block_defs, STATIC_CHAIN_INCOMING_REGNUM); #else #ifdef STATIC_CHAIN_REGNUM bitmap_set_bit (df->entry_block_defs, STATIC_CHAIN_REGNUM); #endif #endif r = TARGET_STRUCT_VALUE_RTX (current_function_decl, true) if (r && REG_P (r)) bitmap_set_bit (df->entry_block_defs, REGNO (r)); } /* These registers are live everywhere. */ if (! reload_completed) { /* Any reference to any pseudo before reload is a potential reference of the frame pointer. */ bitmap_set_bit (df->entry_block_defs, FRAME_POINTER_REGNUM); #if FRAME_POINTER_REGNUM != ARG_POINTER_REGNUM /* Pseudos with argument area equivalences may require reloading via the argument pointer. */ if (fixed_regs[ARG_POINTER_REGNUM]) bitmap_set_bit (df->entry_block_defs, ARG_POINTER_REGNUM); #endif /* Any constant, or pseudo with constant equivalences, may require reloading from memory using the pic register. */ if ((unsigned) PIC_OFFSET_TABLE_REGNUM != INVALID_REGNUM && fixed_regs[PIC_OFFSET_TABLE_REGNUM]) bitmap_set_bit (df->entry_block_defs, PIC_OFFSET_TABLE_REGNUM); } EXECUTE_IF_SET_IN_BITMAP (df->entry_block_defs, 0, i, bi) { rtx def = df_reg_def_gen (true, i); df_ref_create_structure (dflow, regno_reg_rtx[i], &XEXP (def, 0), ENTRY_BLOCK_PTR, NULL, DF_REF_REG_DEF, DF_REF_ARTIFICIAL); } }
Re: question about registers that are set on entry to a function.
Ian Lance Taylor wrote: Kenneth Zadeck <[EMAIL PROTECTED]> writes: 1) Do you believe that this code could be correct? Well, no. You do not have to sugar coat it, I can handle the truth. 2) Most of the paragraphs are protected by either reload_completed or epilogue_completed (which has a comment at the its point of declaration that it is true for both the prologue and epilogue). Is it true that after these epochs, that there will really be code in the first basic block that will def these registers? I.e. are any of them implicitly set before the function is called so I will not find a def? This isn't quite accurate. When the RTL is initially generated, for any values passed in parameters you will see at the start of the function (set (pseudo-reg) (param-reg)) That is, parameter handling is independent of the prologue/epilogue code, and thus independent of epilogue_completed. If you want to declare that all FUNCTION_ARG_REGNO_P registers are live, you should do that regardless of epilogue_completed. Actually FUNCTION_ARG_REGNO_P is not enough. It returns true for outgoing function registers. On a target with register windows, at the start of the function you need to apply INCOMING_REGNO to each register for which FUNCTION_ARG_REGNO_P returns true. The prologue code which is set up when epilogue_completed is true is code which sets up the stack frame and saves the callee saved registers on the stack. So when epilogue_completed is true, then at the start of the function you will probably want to treat as live every callee saved register for which regs_ever_live is true. For some targets the frame pointer will be live on function entry, for others it is initialized by the prologue if necessary. Assuming that PIC_OFFSET_TABLE_REGNUM is defined isn't always going to be correct, although it may not matter. On some targets, e.g., i386, the register will be defined by the prologue code anyhow. This all seems doable. I can spin a better version of this. 3) Am I missing any classes of registers? For complete accuracy, there are probably going to be some target specific registers which need to be handled, unfortunately. For example, on MIPS, with -mabicalls (which is the default on GNU/Linux), $25 is live on function entry. It holds the address of the function, and is used to initialize $28, which is PIC_OFFSET_TABLE_REGNUM. I don't think there is any target independent way of getting at that fact. Is the right thing to do to define a target hook for this that defaults to doing nothing and we only define it in those places where necessary or should I add a few other macros to the ports as necessary and just check to see if they are defined? I have noticed that there are several macros that are only used on a single port. It is also worth noting that rather than assuming that all FUNCTION_ARG_REGNO_P registers are live, you could use FUNCTION_ARG/FUNCTION_ARG_ADVANCE to walk through the parameters and determine precisely which ones are live. I don't know whether that will make a significant difference for the rest of the code, though. I will look into this also. Ian
Re: question about registers that are set on entry to a function.
Ian Lance Taylor wrote: Kenneth Zadeck <[EMAIL PROTECTED]> writes: For complete accuracy, there are probably going to be some target specific registers which need to be handled, unfortunately. For example, on MIPS, with -mabicalls (which is the default on GNU/Linux), $25 is live on function entry. It holds the address of the function, and is used to initialize $28, which is PIC_OFFSET_TABLE_REGNUM. I don't think there is any target independent way of getting at that fact. Is the right thing to do to define a target hook for this that defaults to doing nothing and we only define it in those places where necessary or should I add a few other macros to the ports as necessary and just check to see if they are defined? I have noticed that there are several macros that are only used on a single port. I think the best approach would be a target hook. I guess you could pass it the function decl and have it set bits in a bitmap or a HARD_REG_SET. You might even want to do the whole thing in a target hook, in which case the code you wrote would become the default version of the hook (and would probably be called by non-default versions). I don't have a strong feeling as to which approach would be best. Ian Being the glutton for punishment, lets try this again. I have one question about EH_USES on the ia-64 (the only place it is used). The function that implements this starts out with int ia64_eh_uses (int regno) { if (! reload_completed) return 0; This means that before register allocation, all of the uses of these registers are hidden. However, in the definition of REG_ALLOC_ORDER contains all of the register that are defined in ia64_eh_uses after reload. It is at the end of the order and there are a large number of registers but it seems to me that if someone were to write a program that needed a large boatload of registers, the register allocator might be tempted to use these registers (unless there are no instruction patterns that use these registers) and then the instruction exception unwinder would not work. Is this a latent bug? Asside from the concern from above and plumbing the target hook, how does this look? /* Record the (conservative) set of hard registers that are defined on entry to the function. */ static void df_record_entry_block_defs (struct dataflow * dflow) { unsigned int i; bitmap_iterator bi; rtx r; struct df * df = dflow->df; bitmap_clear (df->entry_block_defs); if (! (df->flags & DF_HARD_REGS)) return; for (i = 0; i < FIRST_PSEUDO_REGISTER; i++) { if (FUNCTION_ARG_REGNO_P (i)) #ifdef INCOMING_REGNO bitmap_set_bit (df->entry_block_defs, INCOMING_REGNO (i)); #else bitmap_set_bit (df->entry_block_defs, i); #endif } /* Once the prologue has been generated, all of these registers should just show up in the first regular block. */ if (HAVE_prologue && epilogue_completed) { /* Defs for the callee saved registers are inserted so that the pushes have some defining location. */ for (i = 0; i < FIRST_PSEUDO_REGISTER; i++) if ((call_used_regs[i] == 0) && (regs_ever_live[i])) bitmap_set_bit (df->entry_block_defs, i); } else { if (REG_P (INCOMING_RETURN_ADDR_RTX)) bitmap_set_bit (df->entry_block_defs, REGNO (INCOMING_RETURN_ADDR_RTX)); /* If STATIC_CHAIN_INCOMING_REGNUM == STATIC_CHAIN_REGNUM only STATIC_CHAIN_REGNUM is defined. If they are different, we only care about the STATIC_CHAIN_INCOMING_REGNUM. */ #ifdef STATIC_CHAIN_INCOMING_REGNUM bitmap_set_bit (df->entry_block_defs, STATIC_CHAIN_INCOMING_REGNUM); #else #ifdef STATIC_CHAIN_REGNUM bitmap_set_bit (df->entry_block_defs, STATIC_CHAIN_REGNUM); #endif #endif r = TARGET_STRUCT_VALUE_RTX (current_function_decl, true) if (r && REG_P (r)) bitmap_set_bit (df->entry_block_defs, REGNO (r)); } /* These registers are live everywhere. */ if (!reload_completed) { /* Any reference to any pseudo before reload is a potential reference of the frame pointer. */ bitmap_set_bit (df->entry_block_defs, FRAME_POINTER_REGNUM); #ifdef EH_USES /* The ia-64, the only machine that uses this, does not define these until after reload. */ for (i = 0; i < FIRST_PSEUDO_REGISTER; i++) if (EH_USES (i)) { bitmap_set_bit (df->entry_block_defs, i); } #endif #if FRAME_POINTER_REGNUM != ARG_POINTER_REGNUM /* Pseudos with argument area equivalences may require reloading via the argument pointer. */ if (fixed_regs[ARG_POINTER_REGNUM]) bitmap_set_bit (df->entry_block_defs, ARG_POINTER_REGNUM); #endif #ifdef PIC_OFFSET_TABLE_REGNUM /* Any constant, or pseudo with constant equivalences, may require reloading from memory using the pic register. */ if ((unsigned) PIC_OFFSET_TABLE_REGNUM != INVALID
Re: Mainline build failure
Steven Bosscher wrote: Hi, I can't build the trunk today: gcc -c -O0 -g -DIN_GCC -W -Wall -Wwrite-strings -Wstrict-prototypes -Wmissing-prototypes -pedantic -Wno-long-long -Wno-variadic-macros -Wold-style-definition -Wmissing-format-attribute-DHAVE_CONFIG_H -I. -I. -I../../trunk/gcc -I../../trunk/gcc/. -I../../trunk/gcc/../include -I../../trunk/gcc/../libcpp/include -I../../trunk/gcc/../libdecnumber -I../libdecnumber../../trunk/gcc/df-problems.c -o df-problems.o ../../trunk/gcc/df-core.c: In function ‘df_compact_blocks’: ../../trunk/gcc/df-core.c:795: error: invalid lvalue in assignment ../../trunk/gcc/df-core.c:803: error: invalid lvalue in assignment ../../trunk/gcc/df-core.c: In function ‘df_bb_replace’: ../../trunk/gcc/df-core.c:833: error: invalid lvalue in assignment make[2]: *** [df-core.o] Error 1 make[2]: *** Waiting for unfinished jobs Path to problem: ../trunk/configure --enable-languages=c --disable-{libmudflap,nls,libssp,checking} --disable-bootstrap make -j 2 CFLAGS="-O0 -g" My host compiler is "gcc (GCC) 4.0.2 20050901". Gr. Steven I posted the obvious patch to this a few hours ago, but because of the problems with the mail, no one has seen it. I just talked to david edelsohn and he just told me to check it in. It will be committed in a minute.
Re: Mainline build failure
Steven, actually the last piece of mail I sent you was a lie. The bug I fixed is different. The problem is that between the time that I check my code in an now, someone changed the representation of BASIC_BLOCK. lorien:~/gccDFTest/gcc(27) diff basic-block.h ../../gccBaseline/gcc 370c370 < varray_type x_basic_block_info; --- > VEC(basic_block,gc) *x_basic_block_info; 402c402 < (VARRAY_BB (basic_block_info_for_function(FN), (N))) --- > (VEC_index (basic_block, basic_block_info_for_function(FN), (N))) 414c414,415 < #define BASIC_BLOCK(N)(VARRAY_BB (basic_block_info, (N))) --- > #define BASIC_BLOCK(N)(VEC_index (basic_block, basic_block_info, (N))) > #define SET_BASIC_BLOCK(N,BB) (VEC_replace (basic_block, basic_block_info, (N), (BB))) I will change my code and submit something when I get it thru stage 1 build. kenny Steven Bosscher wrote: On Wednesday 11 January 2006 21:44, Steven Bosscher wrote: Hi, I can't build the trunk today: gcc -c -O0 -g -DIN_GCC -W -Wall -Wwrite-strings -Wstrict-prototypes -Wmissing-prototypes -pedantic -Wno-long-long -Wno-variadic-macros -Wold-style-definition -Wmissing-format-attribute-DHAVE_CONFIG_H -I. -I. -I../../trunk/gcc -I../../trunk/gcc/. -I../../trunk/gcc/../include -I../../trunk/gcc/../libcpp/include -I../../trunk/gcc/../libdecnumber -I../libdecnumber../../trunk/gcc/df-problems.c -o df-problems.o ../../trunk/gcc/df-core.c: In function ‘df_compact_blocks’: ../../trunk/gcc/df-core.c:795: error: invalid lvalue in assignment ../../trunk/gcc/df-core.c:803: error: invalid lvalue in assignment ../../trunk/gcc/df-core.c: In function ‘df_bb_replace’: ../../trunk/gcc/df-core.c:833: error: invalid lvalue in assignment make[2]: *** [df-core.o] Error 1 make[2]: *** Waiting for unfinished jobs From df-core.c: free (problem_temps); i = NUM_FIXED_BLOCKS; FOR_EACH_BB (bb) { BASIC_BLOCK (i) = bb; bb->index = i; i++; } gcc_assert (i == n_basic_blocks); for (; i < last_basic_block; i++) BASIC_BLOCK (i) = NULL; Now look at cfg.c:compact_blocks... df-core.c should use that function instead of duplicating it. Gr. Steven
reviewers for wide int.
Richi, David Edelsohn said that I should talk to you about appointing reviewers for wide-int.While I think that it may not be necessary to have any reviewers for wide-int in the long term, I think that it would be useful to make Richard Sandiford, Mike Stump and myself reviewers at least for this release cycle. While of course one hopes that there will be no issues with wide-int, a change of this size will have some pain no matter how well we have tested it. Having three reviewers will assure problems are resolved quickly. Thanks, Kenny
Re: reviewers for wide int.
On 04/22/2014 03:37 PM, Richard Biener wrote: On April 22, 2014 9:28:15 PM CEST, Kenneth Zadeck wrote: Richi, David Edelsohn said that I should talk to you about appointing reviewers for wide-int.While I think that it may not be necessary to have any reviewers for wide-int in the long term, I think that it would be useful to make Richard Sandiford, Mike Stump and myself reviewers at least for this release cycle. While of course one hopes that there will be no issues with wide-int, a change of this size will have some pain no matter how well we have tested it. Having three reviewers will assure problems are resolved quickly. Works for me. I suppose this mainly covers wide-int.[CH], right? if you want to define it that narrowly you can. it really depends on how much help you want and how much you trust us not to go beyond what is reasonable. All three of us have been at this long enough to know when to ask for help. Kenny Richard. Thanks, Kenny
Re: reviewers for wide int.
On 04/23/2014 04:04 AM, Richard Biener wrote: On Tue, 22 Apr 2014, Mike Stump wrote: On Apr 22, 2014, at 12:48 PM, Kenneth Zadeck wrote: While of course one hopes that there will be no issues with wide-int, a change of this size will have some pain no matter how well we have tested it. Having three reviewers will assure problems are resolved quickly. Works for me. I suppose this mainly covers wide-int.[CH], right? if you want to define it that narrowly you can. it really depends on how much help you want and how much you trust us not to go beyond what is reasonable. All three of us have been at this long enough to know when to ask for help. There is a large class of bugs that can creep in due to the subtle change of interface from double-int to wide-int. These happen outside of the wide-int.[ch] code and seem statistically more likely by a large margin than bugs in wide-int.[ch]. The good news, resolving them is easy enough with side-by-side comparisons (say of dump files and .s files). Most of those fixes I’d expect to be trivial (for some definition of trivial). Yeah. Note that it's difficult to define "reviewer for code that uses wide-int", thus my question (that is, what do you put into MAINTAINERS and how would you interpret the entry). But as always we apply common sense to reviewer/maintainership areas. richi. This is not without precedent. The dataflow reviewers are authorized to review changes to data flow anywhere in the rtl level and back ends. In the many years that that has been in place none of us "went rogue".We will be conservative. kenny Richard.
status of wide-int patch.
At this point we have believe that we have addressed all of the concerns that the community has made about the wide-int branch. We have also had each of the sections of the branch approved by the area maintainers. We are awaiting a clean build on the arm and are currently retesting x86-64, s390, and p7 but assuming that those are clean, we are ready to merge this branch into trunk in the next day or so.Other port maintainers may wish consider testing on the branch before we commit. Otherwise we will fix any regressions after the merge. Thanks for all of the help we have received along the way. Kenny
merging the wide-int branch
We are now ready to merge the wide-int branch.The branch was broken into small pieces and each of the area maintainers has approved their pieces. The branch has been tested and runs regression free on three 64 bit platforms: x86, ppc, and s390 and on three 32 bit platforms: x86, arm and s390.The range of these includes both big and little endian for both sizes. The testing has included all supported front ends. Our plan, unless we hear any objections, is to merge tomorrow, on Tuesday May 6.While I am sure there will be some fallout, there are wide-int maintainers on all time zones so we should be able to address any issues in a hurry. Thanks, Kenny
we are starting the wide int merge
please hold off on committing patches for the next couple of hours as we have a very large merge to do. thanks. kenny
Re: How to update reg_dead notes
It is generally as easy as just adding the problem and calling df_analyze. You tend to get into trouble if the rtl is not good, i.e. there is improper sharing or other violations of the canonical rtl rules. DF does not like improperly shared rtl and it has not been uncommon for port specific passes to play loose with these rules. I would suggest that first you try a build where you turn on all of the rtl checking and see if that comes out clean. Kenny On 02/24/2015 06:41 AM, Georg-Johann Lay wrote: Hi, in order to fix PR64331 I tried to implement new target-specific passes whose sole purpose is to recompute REG_DEAD notes. The avr BE relies on correct dead notes which are used in avr.c:reg_unused_after which uses dead_or_set_p. avr BE needs correct dead notes in ADJUST_INSN_LENGTH, get_attr_length, get_attr_min_length, etc. After trying for more than one day I am really frustrated; each approach ended up in seg_faults somewhere in df. Following the source comments in df-core.c, recomputing dead notes should be as easy as df_note_add_problem (); df_analyze (); in the execute() method of the new pass. As this (and many many other tries using df_scan_alloc, df_scan_blocks df_finish_pass, df_insn_rescan_all, etc.) always crashes the compiler, I must do something completely wrong... Could you give me some advice on correct usage of df or even more preferred point me to a comprehensible documentation of df which is more complete than in df-core.c? Internals don't treat df, and the source comments are not really helpful, e.g. the complete documentation of df_analyze is /* Analyze dataflow info. */. Not a single word about prerequisites (except that it must run after df_init and before df_finish and needs correct cfg). One example of a crash is that df->insns[uid] is being accessed and dereferenced where uid is a valid uid but df->insns[uid] is NULL. df-core.c mentions "given instances of df". How do I get one? The only instance I can find is the global struct df_f *df. Does this mean one has to mess around with that global variable?
Re: How to update reg_dead notes
when i suggested that you do a build with all of the checking turned on, i wanted you to this without you new code in it.there is a good possibility that the problem is that your port is generating bad rtl. Also, you should generate a debuggable compiler so that the line numbers have some relevance to reality. On 02/24/2015 01:23 PM, Georg-Johann Lay wrote: Am 02/24/2015 um 06:06 PM schrieb Eric Botcazou: Could you give me some advice on correct usage of df or even more preferred point me to a comprehensible documentation of df which is more complete than in df-core.c? Take a look at the c6x and mep ports. Thanks for the hint. I changed the execute method to: unsigned int execute () { compute_bb_for_insn (); //df_clear_flags (DF_LR_RUN_DCE); df_note_add_problem (); df_analyze (); df_finish_pass (false); return 0; } bit it keeps aborting. Actually I am just copy pasting code from some other passes / BEs, but these places also don't have explanation for why they must use, may use, must not use specific functions. Compiling the mentioned test case from libgcc with -Os (so that less insns are left over) and patching df-scan.c:df_refs_verify to print the refs just before it does gcc_assert (0): new_ref = { u-1(18){ }} *old_rec = { u-1(18){ }u-1(19){ }u-1(24){ }u-1(25){ }} libgcc2-addvsi3.c: In function '__addvhi3': libgcc2-addvsi3.c:1514:1: internal compiler error: in df_refs_verify, at df-scan.c:4338 In df_insn_refs_verify which is in the call chain of df_refs_verify, insn reads: (insn 21 19 22 5 (parallel [ (set (cc0) (compare (reg/v:HI 18 r18 [orig:48 a ] [48]) (reg/v:HI 24 r24 [orig:46 w ] [46]))) (clobber (scratch:QI)) ]) libgcc2-addvsi3.c:1511 408 {*cmphi} (expr_list:REG_DEAD (reg/v:HI 18 r18 [orig:48 a ] [48]) (nil))) r18 and r24 are ordinary general-purpose hard registers. The latest pass which runs before the crash is .split5, i.e. recog.c::pass_split_for_shorten_branches which executes split_all_insns_noflow which in turn reads: /* Same as split_all_insns, but do not expect CFG to be available. Used by machine dependent reorg passes. */ unsigned int split_all_insns_noflow (void) { ... Does this mean CFG is (or might be) messed up after this pass and this is the reason for why df crashes because df needs correct CFG? Johann
Re: How to update reg_dead notes
it is a good policy to do a run with full checking every once in a while. it is easy to have some garbage sneak in.df is particularly sensitive to bad rtl so that was my first suggestion.i had forgotten that df (as well as a lot of the infrastructure) does not work very late. On 02/25/2015 01:54 PM, Georg-Johann Lay wrote: Am 02/24/2015 um 07:33 PM schrieb Kenneth Zadeck: when i suggested that you do a build with all of the checking turned on, i wanted you to this without you new code in it.there is a good possibility that the problem is that your port is generating bad rtl. Ah, ok. This actually revealed an ice-checking; but that was not related to the df problems... I think the problem was that I tried to run passes needing df after free_cfg. My current solution works so far: It fixes the test case (correct code + insn lengths) and it did not ICE up to now. Stuff is running a bit slow with all checks on so building and running tests will take a while... FYI, you find my current patch attached. Thanks for your help and time, Johann Index: config/avr/avr.c === --- config/avr/avr.c(revision 220963) +++ config/avr/avr.c(working copy) @@ -51,6 +51,8 @@ #include "target-def.h" #include "params.h" #include "df.h" +#include "context.h" +#include "tree-pass.h" /* Maximal allowed offset for an address in the LD command */ #define MAX_LD_OFFSET(MODE) (64 - (signed)GET_MODE_SIZE (MODE)) @@ -285,6 +287,58 @@ avr_to_int_mode (rtx x) } +static const pass_data avr_pass_data_recompute_notes = +{ + RTL_PASS, /* type */ + "avr-xx", /* name (will be patched) */ + OPTGROUP_NONE, /* optinfo_flags */ + false, /* has_gate */ + true, /* has_execute */ + TV_DF_SCAN, /* tv_id */ + 0, /* properties_required */ + 0, /* properties_provided */ + 0, /* properties_destroyed */ + 0, /* todo_flags_start */ + (TODO_df_finish | TODO_verify_rtl_sharing + | TODO_verify_flow ), /* todo_flags_finish */ +}; + + +class avr_pass_recompute_notes : public rtl_opt_pass +{ +public: + avr_pass_recompute_notes (gcc::context *ctxt, const char *pass_name) +: rtl_opt_pass (avr_pass_data_recompute_notes, ctxt) + { +name = pass_name; + } + + unsigned int execute () + { +df_note_add_problem (); +df_analyze (); + +return 0; + } +}; // avr_pass_recompute_notes + + +static void +avr_register_passes (void) +{ + /* This avr-specific pass (re)computes insn notes, in particular REG_DEAD + notes which are used by `avr.c::reg_unused_after' and branch offset + computations. These notes must be correct, i.e. there must be no + dangling REG_DEAD notes; otherwise wrong code might result, cf. PR64331. + + DF needs (correct) CFG, hence right before free_cfg is the last + opportunity to rectify notes. */ + + register_pass (new avr_pass_recompute_notes (g, "avr-notes-free-cfg"), + PASS_POS_INSERT_BEFORE, "*free_cfg", 1); +} + + /* Implement `TARGET_OPTION_OVERRIDE'. */ static void @@ -346,6 +400,11 @@ avr_option_override (void) init_machine_status = avr_init_machine_status; avr_log_set_avr_log(); + + /* Register some avr-specific pass(es). There is no canonical place for + pass registration. This function is convenient. */ + + avr_register_passes (); } /* Function to set up the backend function structure. */
Re: inconsistencies in the documentation regarding side effects with auto inc-dec
i will fix it. kenny On 03/17/2010 07:28 PM, Ramana Radhakrishnan wrote: > Hi Kenneth, > > The documentation of auto-inc-dec still refers to flow and when I > raised this on IRC folks suggested that you might have some > documentation fixes if any, in this area. > > http://gcc.gnu.org/onlinedocs/gccint/Incdec.html#Incdec > > The lines in the doco are as below : > > These embedded side effect expressions must be used with care. > Instruction patterns may not use them. Until the `flow' pass of the > compiler, they may occur only to represent pushes onto the stack. The > `flow' pass finds cases where registers are incremented or decremented > in one instruction and used as an address shortly before or after; > these cases are then transformed to use pre- or post-increment or > -decrement. > > > Cheers > Ramana >
Re: [lto][RFC] Do not emit hybrid object files
Andrew Pinski wrote: > On Fri, Oct 17, 2008 at 2:15 PM, Diego Novillo <[EMAIL PROTECTED]> wrote: > >> On Fri, Oct 17, 2008 at 16:51, Richard Henderson <[EMAIL PROTECTED]> wrote: >> >> >>> If the version of GCC being used isn't compatible with the version of the IL >>> in the object file, we can just fall back on the final code. >>> >> Fair enough. But this could be provided via a flag to optionally emit >> final code. The more common scenario is likely to be hermetic builds >> that use the same compiler throughout. Duplicating code generation >> for *every* build seems too wasteful to make it the compiler's default >> behaviour. >> > > The idea was that people did not have to change their makefiles. If > some static library provided by vendor 1 does not include the real > machine code, then the build just fails. So having the machine code > by default is still seems like the correct idea to go forward and > provided a flag to turn off it if really needed (though I think it is > bad to have two different modes in this case). > > Thanks, > Andrew Pinski > Sorry to have missed most of this thread today. i was at a meeting in the city all day. Andrew is correct that the reason for putting both lto and final code in the same file was to do the least damage to peoples build tools. A change from each invocation of gcc produce two files instead of one will severely break many people's build environments. Mark and I were very concerned about how widely used LTO would be, if trying it cost more than adding -flto in several places. As far as the support of non elf compilers, the idea was to minimize and modularize the actual dependence of elf as much as possible and then to let anyone who was still non elf fill in the pieces. At the present time, all of the uses of elf go thru indirect calls so in theory this will be easy. However, in practice this is likely to be harder. For instance the current lto implementation assumes that there can be a lot of different sections, but for instance the aix format does not allow this.
Re: [lto][RFC] Do not emit hybrid object files
Diego Novillo wrote: > On Fri, Oct 17, 2008 at 20:52, Kenneth Zadeck <[EMAIL PROTECTED]> wrote: > > >> Andrew is correct that the reason for putting both lto and final code in >> the same file was to do the least damage to peoples build tools. A >> change from each invocation of gcc produce two files instead of one will >> severely break many people's build environments. Mark and I were very >> concerned about how widely used LTO would be, if trying it cost more >> than adding -flto in several places. >> > > OK, we can then offer the option of emitting hybrid objects for > libraries that want to provide the duality. For the final user, the > distinction will be transparent. However, for the most common use > scenario, emitting hybrid files is really wasteful. > > I actually think that the hybrid files should be the default. If you are willing to make invasive changes to your build environment to support two files, then you should be willing to add extra options to support that. >> However, in practice this is likely to be harder. >> > > Yes, agreed. > > > Diego. >
bitwise dataflow analysis
Silvius, If you want to persue this, you should go back and look at my patches for subreg level dataflow and start hacking from there. It is not a big step to start from that and modify it for bits. If you start from that, it is most likely not much more than a few days work for the analysis part. However, hacking combine could be a life long effort! All of this code got removed because it's only client was the old ra. Kenny
naked zero_extracts longer than a word.
Would those that know, (or even those that are just generally vocal) be willing to support a change rtl.texi for sign_extract (and by implication, zero_extract) from If @var{loc} is in memory, its mode must be a single-byte integer mode. If @var{loc} is in a register, the mode to use is specified by the operand of the @code{insv} or @code{extv} pattern (@pxref{Standard Names}) and is usually a full-word integer mode, which is the default if none is specified. to a version that explicitly prohibits the use of a mode longer than a word? The back ends do not appear to support this anyway. Kenny
Re: Fwd: Register Pressure in Instruction Level Parallelism
David Edelsohn wrote: > -- Forwarded message -- > From: Albert Cohen > Date: Sun, Jun 28, 2009 at 6:25 PM > Subject: Re: Register Pressure in Instruction Level Parallelism > To: Dave Korn > Cc: re...@meinersbur.de, gcc@gcc.gnu.org, Sid Touati > , Frederic Brault > > > Hi all, > > I'm convinced that saturation and sufficiency based approaches are the > future of register pressure management. > > [Please keep my colleague Sid Touati in CC of this thread, until he > registers to the list.] > > Unfortunately, the state of the art (more recent that the thesis > referenced in the original email, see Touati's web page) is limited to > basic block and software-pipelining scopes, and limited to scheduling. > > Compared to the tasks currently managed by reload, it certainly misses > a whole bunch of instruction selection and immediate operand/address > mode corner case problems (depending on the target). It also misses > global scheduling, but extended BB scheduling is not very hard to > approximate, as well as superblock scheduling. > > > I'm not at all a knowledgeable person to tell you what to do in the > case of GCC, but for sure saturation/sufficiency-based approches are > not sufficient to kill the dragon. > > However, I will strongly advise anybody (= Kenny Zadeck) looking at a > fresh SSA-based backend design to consider such an approach to avoid > messing up with pressure-aware-* where * is any backend pass that > bears a risk of blowing up register pressure. > > I am considering such a design and i did, late last week to go public with my project. I am now working to put up a project in the public spaces. I would very much like the help of anyone who is interested in working in this area to consider working on it in gcc. kenny > If you interested in collaboration on the topic, we are about to start > extending those approaches to general control-flow, and implementing > it in a production compiler (not GCC, but a free one, and not LLVM > either). > > Albert > > > Dave Korn wrote: > >> Michael Kruse wrote: >> >> >>> So, now my questions: How much do you think could this could improve >>> compiled code speed? Would the current GCC/YARA benefit from such an >>> optimization pass at all? What are the chances that this could get into >>> the main GCC tree if it shows up to be an improvement? >>> >> One of the major problems in gcc is the intertangling of instruction >> selection with register allocation and spill generation. If these could be >> separated it would almost certainly generate better code and be welcomed with >> open arms! >> >> >>> I'd prefer to implement this for the gcc, but my advisor wants me to do >>> it for the university's own compiler. Therefore I could also need >>> arguments why to do it for the GCC. >>> >> Because destroying reload(*) would be an incalculable public service and >> your name will be remembered in history as the one who slew the dragon? ;-) >> >>cheers, >> DaveK >>
Re: Mainline bootstrap failure (revision 110017)
Daniel Berlin wrote: On Fri, 2006-01-20 at 10:53 +0530, Ranjit Mathew wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, Mainline fails to bootstrap for me (revision 110017) on i686-pc-linux-gnu. Configured as: $GCC_SRC_DIR/configure --prefix=$HOME/gcc --enable-languages=c,c++,java \ - --with-as=/home/ranmath/gnu/bin/as --with-gnu-as \ - --with-ld=/home/ranmath/gnu/bin/ld --with-gnu-ld \ - --with-arch=pentium4 --with-tune=pentium4 \ - --disable-nls --disable-checking --disable-libmudflap \ - --disable-debug --enable-threads=posix --enable-__cxa_atexit \ - --disable-static Kenny thought it would be nice, rather than pass the actual bb info to free to the freeing function, to instead pass some random bitmap. The attached fixes *that*, but this just causes a crash deeper in trying to free some chains. However, it looks like that is either caused by a double free, or because we never null out pointers to things after we free the memory for what they are pointing to. Index: df-core.c === --- df-core.c (revision 110017) +++ df-core.c (working copy) @@ -292,6 +292,7 @@ are write-only operations. static struct df *ddf = NULL; struct df *shared_df = NULL; +static void * df_get_bb_info (struct dataflow *, unsigned int); /* Functions to create, destroy and manipulate an instance of df. */ @@ -370,7 +371,7 @@ df_set_blocks (struct df *df, bitmap blo EXECUTE_IF_SET_IN_BITMAP (diff, 0, bb_index, bi) { basic_block bb = BASIC_BLOCK (bb_index); - (*dflow->problem->free_bb_fun) (dflow, bb, diff); + (*dflow->problem->free_bb_fun) (dflow, bb, df_get_bb_info (dflow, bb_index)); } } } This is why c sucks. In any language with a reasonable type system, this could have been written with a subtype and would not require using a void * arg. Sorry for the problems and thanks for looking into them. kenny
Re: Mainline bootstrap failure (revision 110017)
Andreas Jaeger wrote: Kenneth Zadeck <[EMAIL PROTECTED]> writes: Daniel Berlin wrote: The attached fixes *that*, but this just causes a crash deeper in trying to free some chains. [...] Sorry for the problems and thanks for looking into them. Ken, please reread the email. The issue is *not* fixed according to Daniel, there's still another problem. Could you look into it, please? Andreas I will update and start looking. sorry kenny
Re: Mainline bootstrap failure (revision 110017)
Andreas Jaeger wrote: Kenneth Zadeck <[EMAIL PROTECTED]> writes: Daniel Berlin wrote: The attached fixes *that*, but this just causes a crash deeper in trying to free some chains. [...] Sorry for the problems and thanks for looking into them. Ken, please reread the email. The issue is *not* fixed according to Daniel, there's still another problem. Could you look into it, please? Andreas I would like permission to revert zdenek's patch for a few days. There is nothing wrong with zdenek's patch, pe se, but it excercises a part of df that should work but does not. We could revert my storage patch, but the problem is much deeper than that. The storage patch only causes this problem to happen when bootstrapping. The problem will still be there and may cause random ices when running zdeneks code. The problem is that when ever you delete any basic blocks, you need to get rid of the def use and use def chains that either start or end in the deleted block, furthermore, this has to be done before the instructions are deleted for those blocks. This will take me a day to get correct. Zdenek's patch is the only code that manipulates the blocks after use-def or def-use chains are built. Kenny
Re: Mainline bootstrap failure (revision 110017)
Zdenek Dvorak wrote: Hello, The attached fixes *that*, but this just causes a crash deeper in trying to free some chains. [...] Sorry for the problems and thanks for looking into them. Ken, please reread the email. The issue is *not* fixed according to Daniel, there's still another problem. Could you look into it, please? I would like permission to revert zdenek's patch for a few days. There is nothing wrong with zdenek's patch, pe se, but it excercises a part of df that should work but does not. I don't quite like this idea; as you yourself say, the problem is not in that patch (in fact, mainline bootstraps and passes regtesting with it without your patch); so I think that if we indeed want to go into reverting patches, it should be the one causing the problem (and I have a followup patch that I would like to be able to test). Btw.: of course it may happen that some patch sometimes breaks bootstrap, it happened to everyone of us. But, with your patch, not even libgcc builds. This means that you did not even try to build gcc before commiting your patch. Please do that next time. In fact, running at least a partial bootstrap till gengtype is run helped me avoid introducing bootstrap failures (caused by unexpected interaction with patches commited since I tested the patch fully the last time) several times, so it is a good idea even if you are quite sure that no such problem may occur. Zdenek, There are several problems here. If I fix the immediate problem, you only hit the next one. The underlying problem is that when you change the blocks in the program and you have def-use or use-def chains built, these have to be cleaned out in a way that was not being done. If I revert my patch, it will fix the immediate problem that df_set_blocks crashes but will most likely happen in other places. It is because of the other places that it is not sufficient to just revert my patch. Yours is currently the only place that plays with blocks when the use-def chain problem has been built. This is something that I advertised could be done and I blew it. But removing my last patch is not going to fix that, it will still be broken, it will just fail less often. Kenny We could revert my storage patch, but the problem is much deeper than that. The storage patch only causes this problem to happen when bootstrapping. The problem will still be there and may cause random ices when running zdeneks code. The problem is that when ever you delete any basic blocks, you need to get rid of the def use and use def chains that either start or end in the deleted block, furthermore, this has to be done before the instructions are deleted for those blocks. This will take me a day to get correct. Zdenek's patch is the only code that manipulates the blocks after use-def or def-use chains are built. This analysis seems wrong to me -- the crash occurs when called from estimate_probability, and we do not change CFG in branch prediction. Zdenek
Re: Mainline bootstrap failure (revision 110017)
Eric Botcazou wrote: So he updated his tree, saw changes in the middle-end and committed his without testing. So Kenny would have had to lauch a new bootstrap, wait for a couple of hours only to discover that something again changed in-between, and so on? This is exactly what I did, when I got the approval, I did an update, launched a bootstrap and regression test, when that passed, I did another update and committed. I am sure that zdenek has access to 1000 teraflop machine to make this process almost instantanious. The rest of us have to make due with machies that only run at a few gigahertz. kenny
Re: Mainline bootstrap failure (revision 110017)
This fixes the problems that became apparent from zdeneks patch. Bootstrapped and regression tested against the version with zdenek's original code (since this directly tickled the failure and bootstrapped (and in the process of regression testing) against the current mainline. Both on i686-pc-linux-gnu. Kenny 2005-01-20 Kenneth Zadeck <[EMAIL PROTECTED]> * df-scan.c (problem_SCAN): Added NULL reset function. (df_scan_reset_blocks): Added code to call reset block function (df_bb_refs_delete) Fixed comment. (df_insn_refs_delete): Made tolerant of deleting non existent info for dataflow problems that need to be reset. * df-core.c (df_set_blocks): Ditto. * df.h (struct df_problem): Added reset_fun. * df-problems.c (problem_RU, problem_RD, problem_LR, problem_UR, problem_UREC, problem_CHAIN, problem_RI): Initialized reset_fun field. (df_chain_insn_reset, df_chain_bb_reset, df_chain_reset): New functions to clear out all references to def-use or use-def chains. Zdenek Dvorak wrote: Hello, The attached fixes *that*, but this just causes a crash deeper in trying to free some chains. [...] Sorry for the problems and thanks for looking into them. Ken, please reread the email. The issue is *not* fixed according to Daniel, there's still another problem. Could you look into it, please? I would like permission to revert zdenek's patch for a few days. There is nothing wrong with zdenek's patch, pe se, but it excercises a part of df that should work but does not. I don't quite like this idea; as you yourself say, the problem is not in that patch (in fact, mainline bootstraps and passes regtesting with it without your patch); so I think that if we indeed want to go into reverting patches, it should be the one causing the problem (and I have a followup patch that I would like to be able to test). Btw.: of course it may happen that some patch sometimes breaks bootstrap, it happened to everyone of us. But, with your patch, not even libgcc builds. This means that you did not even try to build gcc before commiting your patch. Please do that next time. In fact, running at least a partial bootstrap till gengtype is run helped me avoid introducing bootstrap failures (caused by unexpected interaction with patches commited since I tested the patch fully the last time) several times, so it is a good idea even if you are quite sure that no such problem may occur. We could revert my storage patch, but the problem is much deeper than that. The storage patch only causes this problem to happen when bootstrapping. The problem will still be there and may cause random ices when running zdeneks code. The problem is that when ever you delete any basic blocks, you need to get rid of the def use and use def chains that either start or end in the deleted block, furthermore, this has to be done before the instructions are deleted for those blocks. This will take me a day to get correct. Zdenek's patch is the only code that manipulates the blocks after use-def or def-use chains are built. This analysis seems wrong to me -- the crash occurs when called from estimate_probability, and we do not change CFG in branch prediction. Zdenek Index: df-scan.c === --- df-scan.c (revision 110059) +++ df-scan.c (working copy) @@ -304,6 +304,7 @@ static struct df_problem problem_SCAN = DF_SCAN,/* Problem id. */ DF_NONE,/* Direction. */ df_scan_alloc, /* Allocate the problem specific data. */ + NULL, /* Reset global information. */ df_scan_free_bb_info, /* Free basic block info. */ NULL, /* Local compute function. */ NULL, /* Init the solution specific data. */ @@ -426,6 +427,8 @@ df_rescan_blocks (struct df *df, bitmap if (blocks) { + int i; + /* Need to assure that there are space in all of the tables. */ unsigned int insn_num = get_max_uid () + 1; insn_num += insn_num / 4; @@ -443,6 +446,24 @@ df_rescan_blocks (struct df *df, bitmap df->def_info.add_refs_inline = true; df->use_info.add_refs_inline = true; + for (i = df->num_problems_defined; i; i--) + { + bitmap blocks_to_reset = NULL; + if (*dflow->problem->reset_fun) + { + if (!blocks_to_reset) + { + blocks_to_reset = BITMAP_ALLOC (NULL); + bitmap_copy (blocks_to_reset, local_blocks_to_scan); + if (df->blocks_to_scan) + bitmap_ior_into (blocks_to_reset, df->blocks_to_scan); + } + (*dflow->problem->reset_fun) (dflow, blocks_to_reset); + } + if (blocks_to_reset) + BITMAP_FREE (blocks_to_reset); + } + df_refs_delete (dflow, local_blocks_to_scan); /* This may be a mis
Re: Bootstrap failure on Linux/x86-64
Graham Stott wrote: Andreas, FWIW I've had successful bootstrap with these checking flags on x86_64-unknown-lunux-gnu Graham My bootstrap was fine also using x86_64-suse-linux-gnu. Kenny
Re: Bootstrap failure on Linux/x86-64
Andreas Jaeger wrote: Andrew Pinski <[EMAIL PROTECTED]> writes: On Jan 21, 2006, at 1:09 PM, Andreas Jaeger wrote: On Sat, Jan 21, 2006 at 12:42:24PM -0500, Andrew Pinski wrote: On Jan 21, 2006, at 12:38 PM, Andreas Jaeger wrote: I'm still seeing this with current Subversion. Anybody else having the same problem on x86-64? It fails for me on the two systems I tested it on, What reversion is this on? This worked for me on 110050 and 110059. I'll try 110059 now to double check. Perhaps my configure flags are the problem, I use: /cvs/gcc-svn/trunk/configure --prefix=/opt/gcc/4.2-devel --enable-checking=misc,tree,gc,rtl,rtlflag,assert --enable-threads=posix --enable-clocale=gnu --enable-__cxa_atexit --enable-shared --with-system-zlib x86_64-suse-linux-gnu --enable-languages=c,ada,c++,fortran,java,objc,treelang I just used 110067 and it worked. Maybe it is one of the checking options you are using which is causing it. I just used the default checking. I now used 110067 with the default checking options - and it worked, so it's the checking options. Kenneth, Zdenek, could you look into this? Please use the same checking options that I have above. Let me double check with the checking you have. Thanks, Andreas I am not going to see this failure on my machine. I have a 2gb x86-64 which will easily handle compiling this size file. I was watching top as it compiled a stage2 insn-addrtab, which I think is the largest function in the gcc stack and the VIRT topped at 501mb, which is no sweat for this machine. You could argue, (and I would not disagree with you) that 501mb is not tolerable and is symptomatic of something bad happening. There are, most likely storage leaks in the system. But I do not think that you are going to necessarily find the problem by looking to see which version happens to crash on your machine. The worst case senario is that this is just footprint creep and the version that failed was just the one the added the straw that broke the camels back with you particular config. I am not fully into building stage3 and have no failure. btw, I am doing this with the latest baseline compiler. I can go back to earlier compilers if you want. But given that andreas's failed with the following line: cc1: out of memory allocating 22752576 bytes after a total of 142843904 bytes I am not going to see the problem. Kenny
Re: Mainline bootstrap failure (revision 110017)
Zdenek Dvorak wrote: Hello, On Fri, 20 Jan 2006, Zdenek Dvorak wrote: I propose the following workaround instead, that also restores bootstrap. It changes the way loop-iv uses df to more conservative one, that does not seem to cause problems. That's what I like to see... options. Yes, this is OK for mainline, please hold off on the reversion (or if necessary reapply with this change). OK to revert this workaround now? Mainline now passes bootstrap & regtesting on i686 without it. I tested my code before and after your revert and it passed. As far as I am concerned, put it back. Zdenek * loop-iv.c (iv_analysis_loop_init): Use df analysis in a more efficient way. Index: loop-iv.c === *** loop-iv.c (revision 110143) --- loop-iv.c (working copy) *** iv_analysis_loop_init (struct loop *loop *** 250,260 current_loop = loop; /* Clear the information from the analysis of the previous loop. */ ! if (!first_time) ! iv_analysis_done (); ! df = df_init (DF_HARD_REGS | DF_EQUIV_NOTES); ! df_chain_add_problem (df, DF_UD_CHAIN); ! bivs = htab_create (10, biv_hash, biv_eq, free); for (i = 0; i < loop->num_nodes; i++) { --- 250,263 current_loop = loop; /* Clear the information from the analysis of the previous loop. */ ! if (first_time) ! { ! df = df_init (DF_HARD_REGS | DF_EQUIV_NOTES); ! df_chain_add_problem (df, DF_UD_CHAIN); ! bivs = htab_create (10, biv_hash, biv_eq, free); ! } ! else ! clear_iv_info (); for (i = 0; i < loop->num_nodes; i++) {
very confused about single_set
Steven, Zdenek 1) single_set is built on top of single_set2. 2) single_set_2 uses REG_UNUSED notes. 3) tree-ssa-loop-ivops.c:seq_cost uses single_set. I can find no indication that rtl dfa is run here to provide the information for single_set to produce the correct answer. Inquiring minds want to know. Kenny
Re: very confused about single_set
Daniel Berlin wrote: > On Sun, 2006-02-19 at 09:56 -0500, Kenneth Zadeck wrote: > >> Steven, Zdenek >> >> 1) single_set is built on top of single_set2. >> > Yes > >> 2) single_set_2 uses REG_UNUSED notes. >> > Only if there are multiple sets. > > >> 3) tree-ssa-loop-ivops.c:seq_cost uses single_set. >> > > This is because Zdenek builds RTL to determine target costs. > > >> I can find no indication that rtl dfa is run here to provide the >> information for single_set to produce the correct answer. >> > > I don't understand what you mean. > Why does the DFA need to be run to produce the correct answer? > (Our ports use Deterministic Finite Automaton based schedulers, so > saying DFA is an ambiguous term at best) > > Or do you mean DataFlow Analysis, in which case, i'm sure everyone > expects the notes are always up to date (even though they may not be). > > > I was refering to the reg_unused notes not being there at all when this is called. You are correct that this is not so much a correctness issue as a place where the code is pessimistic. kenny
Re: Reload producing lots of duplicate insns
Steven Bosscher wrote: > Hi, > > While teaching dce.c on the dataflow-branch to delete noop-moves, > I noticed that most noop moves are produced by reload. It inserts > duplicate insns for some reloads, postreload turns the duplicate > reload into a simple reg-reg move (but the lhs and rhs are the same > register of course), and then dce.c removes these moves because > they are noop-moves. > > Steven, Did richard's untested patch not fix this? Kenny
Re: Deprecating -f options?
Steven Bosscher wrote: > On 6/15/06, Joern RENNECKE <[EMAIL PROTECTED]> wrote: >> In http://gcc.gnu.org/ml/gcc/2006-06/msg00549.html, you wrote: >> >> > -ftree-live-range-split splits live ranges on exit from ssa. This is >> > on by default, and in fact it is more work to NOT split unrelated live >> > ranges, and creates extra register pressure. >> >> I've found that in 4.1, -fno-web is often needed in order to generate >> auto-increment addressing. >> I suppose that will be similar for any live range splitting that >> happens before flow. > > Well, is that something you have to blame live range splitting for, or > is perhaps flow just too stupid to see through the split live ranges? > > Actually, Zadeck is working on a new auto-increment pass anyway IIUC. > > Gr. > Steven My pass is redoing the code in flow so that it is not in flow, and does the premodify and postmodify properly. It does not attempt to do what Joern is doing, i.e. changing a reference from one base register to another. Kenny
Re: local data flow
Joern RENNECKE wrote: > I 've been looking at the problem of converting the struct-equiv code > to use DEF-USE chains > instead of global dataflow information, but I have hit a snag. > We can find local registers as being registers that are defined > somewhere in the examined (partial) block, > and have all their uses within the (partial) block. However, there is > no feasible way to find dying inputs > with def-use / use-def chains. We could find the definition with > use-def chains, and examine all the uses > of this definition with def-use chains, but we can't easily tell if a > use could be reached through the examined > (partial) block. > > And this is really not an isolated problem. Generally, when we > examine REG_DEAD notes, there is no > equivalent information to be found in def-use chains. The problem is > that we are not really interested > in what uses can be reached from the definition, but which can be > reached from the use we are examining. > I.e. we'd like a list of uses of the same register after the current > use. In fact, we don't need an exhaustive list. > It is sufficient if we have a set of uses such that any use which is > reachable from the current use is dominated > by one of the uses in the set (a minimal set is desirable, but not > necessary). With these sets in place, we could > also restrict the def-use information to such a set. When we add back > pointers to each link, these data structures > should be reasonably easy to keep sufficiently up-to-date to remain > usable, i.e. such that every dependency is > represented, and that there are no dangling pointers. you can have def-use chains, you can have use-def chains or you can have both. It seems like what you are asking for are use-def chains, i.e. the list of defs that can reach each use. If the set is empty, this is equiv to a reg-dead note. kenny
Re: local data flow
Joern RENNECKE wrote: > Kenneth Zadeck wrote: > >> >> >> you can have def-use chains, you can have use-def chains or you can have >> both. >> It seems like what you are asking for are use-def chains, >> > No, I want to know if there exists a path from the current *use* of a > register to > some other *use* of the same register without going through a def. > The right way to do this is not to build chains but to define your own dataflow problem to do this. I think that what you want is something like the reaching uses problem but you want a forwards version of this rather than a backwards version as is defined in df-problems.c. However, you may find that that problem does provide the info you are looking for. If this is what you want, I can help you define it. >> i.e. the list >> of defs that can reach each use. If the set is empty, this is equiv to >> a reg-dead note. >> >> > Huh? If the set is empty, the use is either uninitialized or > unreachable. yes, I am sorry
Re: local data flow
Joern RENNECKE wrote: > Kenneth Zadeck wrote: > >> >> >> The right way to do this is not to build chains but to define your own >> dataflow problem to do this. >> > But wouldn't I need to update the problem solution every time a change > a bit of the > program - which would be much more costly then doing a local update of > some > local def-firstuse or use-nextuse chains? > depending on what you are doing, you can update the solution in place. The point of the dataflow talk was not to say that you cannot do anything incremental, it was to say that there are no good GENERAL techniques. Many times it is possible to update the solution precisely if you have a very good understanding of the local conditions under which the transformation is done on. The other trick it to do what I do in the new fast dce or the if-cvt on the dataflow branch: order the basic blocks so that it does not matter if you mess up the solution. Generally a post order or a reverse postorder traversial of the basic blocks will have the property that you can just mess things up in a wave that moves from the beginning to the end or the end to the end of the program without ever seeing the mess you make. The only time you need to iterate is if you mess things up a the top of a block that is the destination of the back edge. This is a very useful trick in a compiler. The cfg is your friend. >> I think that what you want is something like the reaching uses problem >> but you want a forwards version of this rather than a backwards version >> as is defined in df-problems.c. >> >> > It is reaching uses, but the starting point is not necessarily a > definition, but is more > often a use. I want to know about uses that are forward of the > current site in the > control flow, but I suppose this is best computed with a backward > propagation of > lifeness data. AFAICT that's the same direction that the current > reaching use problem > has. > the gen set of the reaching uses problem is the set of uses. the kill set are the defs and the clobbers. This is why the problem is called "reaching uses".
Re: local data flow
Joern Rennecke wrote: > In http://gcc.gnu.org/ml/gcc/2006-07/msg00390.html, you write: > >> depending on what you are doing, you can update the solution in place. >> The point of the dataflow talk was not to say that you cannot do >> anything incremental, it was to say that there are no good GENERAL >> techniques. Many times it is possible to update the solution precisely >> if you have a very good understanding of the local conditions under >> which the transformation is done on. >> > > Right. So the struct-equiv code does not have to be converted to > use def-use chains to be beter integrated with thedataflow branch, > it can happily go on using regsets of live registers. > What it needed is a solid understanding about what invariants are > required and how we can maintain them while we are changing the rtl, > in order to keep the cost of updating the information under control. > Which, ironically, is what I proposed to use in the first place outside > of the dataflow branch to address the compile time problems, but Berndt > considerd that approach to fragile. > > > It is fragile, but not in the normal sense. It cannot be generalized to work with any general transformation. You will end up with something that only works for the kinds of transformations that you are doing in the pass you are writing. This is fine. I should point out that you will also not be able to cure cancer with this. You just have to be satisfied to do a good job solving the problem at hand. >> The other trick it to do what I do in the new fast dce or the if-cvt on >> the dataflow branch: >> order the basic blocks so that it does not matter if you mess up the >> solution. Generally a post order or a reverse postorder traversial of >> the basic blocks will have the property that you can just mess things up >> in a wave that moves from the beginning to the end or the end to the end >> of the program without ever seeing the mess you make. >> > > cross-jumping two entire basic blocks creates opportunities for further > cross-jumping and jump simplification involving the predecessor blocks. > This appears to fit a reverse postorder traversal. > > However, if-conversion creates further if-conversion opportunities involving > the newly merged block. Thus, while a postorder / reverse postorder > traversal makes sense, we also need valid information for the newly merged > block, its successors and predecessors. > I suppose reg-live-at-start / reg-live-at-end information is actually easier > to maintain during if-conversion that def-use chains. > > This is true, certainly in theory, a lot less so in practice. The way that you order things is this. while (something changes and we have not done too many passes) do the analysis do one properly ordered pass over the cfg changing what you can change without using information that has been corrupted during this pass. You should limit the number of passes to some small number times the optimization level. you only iterate if you made changes during that pass. You try to keep the dataflow up-to-date. If some transformation is too hard to do this, you mark the block as not transformable until the next pass. In practice, there are few second order effects so you end up iterating twice. I think that what you want is something like the reaching uses problem but you want a forwards version of this rather than a backwards version as is defined in df-problems.c. > ... > >> the gen set of the reaching uses problem is the set of uses. the kill >> set are the defs and the clobbers. This is why the problem is called >> "reaching uses". >> > > In my problem, the gen set is the set of uses, and the kill set are > defs, clobbers and uses. Hence, it's still backwards problem. > > You could define it either way. However, the backwards problem is already there in both the mainline and the dataflow branch. It is called reaching uses or DF_RU. > But it is not really useful to compute this just for one or two code > transformations - benefits could only be obtained if this information > can be kept usable between different transformation to reduce the number > of global updates. > > You have to let it go. This is why we defined it the way we did. Just compute it, use it, and throw it away when you are finished. It is way too hard to figure out how to keep this up to date for all of the passes. > I suppose I should instead continue to operate on regsets, but keep them > usable without frequent global updates, if that's all right with Berndt. > I think you should look closely are DF_RU.
Re: local data flow
Joern RENNECKE wrote: > Kenneth Zadeck wrote: > >>> I suppose reg-live-at-start / reg-live-at-end information is >>> actually easier >>> to maintain during if-conversion that def-use chains. >> This is true, certainly in theory, a lot less so in practice. >> The way that you order things is this. >> while (something changes and we have not done too many passes) >>do the analysis >>do one properly ordered pass over the cfg changing what you can >> change without using information that has been corrupted during this >> pass. >> >> > Except that with the current def-use chains updating def-use chains is > pointless since they > don't have the required information in the first place. >>> In my problem, the gen set is the set of uses, and the kill set are >>> defs, clobbers and uses. Hence, it's still backwards problem. >> You could define it either way. However, the backwards problem is >> already there in both the mainline and the dataflow branch. It is >> called reaching uses or DF_RU. >> >> > When the block has one or more uses, but no def of a register, the > question > to ask is 'which uses of this register reaches this *use* site' > (backwards). That's > different from DF_RU, which calculates this information for def sites > instead. > >> I think you should look closely are DF_RU. > > What more is there to see? When the block I'm looking at only has > uses of a > register, I don't want to know about uses in 'siblings' in the control > flow graph - > what I want to know is if the register is live at the end of the > block. LR should > answer that, and I can keep this information up-to-date inasmuch as it > will > mention all live registers, and remain self-consistent. > Jorne, I do not know what problem you want to solve. You have never given me or I think anyone else a succinct description of exactly what you want to do here. What you did say that you needed was "No, I want to know if there exists a path from the current *use* of a register to some other *use* of the same register without going through a def. " >From that description I assumed that you really did care which uses actually reached which other uses. The reaching uses problem tracks each use separately. If this isn't what you need then you are free to use LR which is certainly much cheaper than RU.
Re: local data flow
Joern RENNECKE wrote: > Kenneth Zadeck wrote: > >> From that description I assumed that you really did care which uses >> actually reached which other uses. The reaching uses problem tracks >> each use separately. If this isn't what you need then you are free to >> use LR which is certainly much cheaper than RU. >> >> > Yes, LR is perfectly adequate. I was thinking in terms of def-use > chains only > because these can be kept more comprehensively up-to-date - i.e. when > a register > is live in a loop solely because it is used afterwards, and the use > after the loop > goes away, def-use chains won't keep a lingering livenes of the > register in the > loop. > But since the exiting def-use chains don't actually fit the problem, > there is no > point in trying to use them here. > > So, in summary, the plan is: > - Use LR (Live Registers) information for if-conversion / cross-jumping. > - Update the information after each transformation to keep a correct, > but not necessarily > minimal solution. > - After traversing the entire flowgraph, if anything changed, > re-compute LR from scratch. > - After re-computing LR, we need to traverse the flowgraph again. > This might mean we > end up doing more if-conversion / cross-jumping checks than with the > current scheme, but > since these checks are local, this should be generally cheaper than > trying to keep a minimal > LR solution all the time. > > I'm not sure how to best handle splitting blocks when doing > cross-jumping. > propagate_block can produce inconsistant liveness information for > stack regs after > reg_stack. Currently this is handled by zeroing out all the stack reg > liveness information. > Maybe we shouldn't even try to update LR for the split upper parts of > the original BBs. > The live-at-end information for the joined lower part should be > bassically the same as both > blocks original BBs had, and doing further optimizations with a > partial or fully joined block > appears not unlikely; however, the remaining unjoined upper parts of > the original BBs > are somehow differentfrom each other, so any further cross-jumping for > them, while not > impossible, seems unlikely - AFAICT it would require a complete > corss-jumping of the joined > lower part with some other block. So maybe we could mark them has > having no valid LR > information, and not consider them again till the next global LR > recomputation. You should do this starting from the dataflow branch. The version of if-cvt there works as I have mentioned in the previous mail and does not use propagate block at all.
Re: local data flow
Joern RENNECKE wrote: > Kenneth Zadeck wrote: > >> >> >> You should do this starting from the dataflow branch. The version of >> if-cvt there works as I have mentioned in the previous mail and does not >> use propagate block at all. >> >> > if-conversion always joins blocks. But cross-jumping merges blocks or > partial blocks. > If the latter, both of the original blocks are split, and the botom > part of the blocks > becomes one merge block. You need to re-compute the live-at-end > information for > the upper parts of the original blocks, which become two separate BBs. > >> >> >> >> > Are you saying that this is a problem with my code or a problem with what you have to do? Updating the LR dataflow when splitting a basic block is generally pretty easy. You start from the end of the block and just interpret the uses and defs for each insn. The code in the new version of if-cvt actually does this where it replaces the old code that used to call propagate block. kenny
Re: local data flow
Joern RENNECKE wrote: > Kenneth Zadeck wrote: > >> Updating the LR dataflow when splitting a basic block is generally >> pretty easy. You start from the end of the block and just interpret >> the uses and defs >> for each insn. >> >> > This means duplicating this aspect of the propagate_block functionality. > If we really have to do this in multiple places, we should have a > function > to do that. > I have functions for it, look at the code. >> The code in the new version of if-cvt actually does this where it >> replaces the old code that used to call propagate block. >> >> > Even propagate_block clouldn't handle stack regs properly. The > problem is > that arithmetic operations of implicit pushses / pops, which > effectively renames > your registers > You won't see this problem for passes that run before reg-stack. is it really necessary to do your pass after reg stack. Given that there is no if conversion that runs after regstack what is your point? I should point out that there are no passes that currently use any dataflow after regstack.
Re: local data flow
Joern RENNECKE wrote: > Kenneth Zadeck wrote: > >> >> >> is it really necessary to do your pass after reg stack. Given that >> there is no if conversion that runs after regstack what is your point? >> >> > I am talking about cross-jumping after regstack. > >> I should point out that there are no passes that currently use any >> dataflow after regstack. >> >> > That's because the current cross-jumping is so stupid. I would be very careful to tread here. regstack is a can of worms and not very good worms. We have had a lot of trouble retrofitting better dataflow into regstack because of the fragile nature of it's implementation. If you think that your cross jumping would be good to do as an interleaved pass to if conversion, it is possible to see the benefit there. However, I think that the code gets hardened a lot by register allocation in it is difficult for me to see how heroic transformations are going to be really useful after that pass. Anyway, many compilers do label the stack and do dataflow on the individual stack elements. GCC is problematic because most of the stack off are set in register allocation, and in truth it really is too late to get much benefit after that pass. But it is not hard to define a problem that did label the stack elements. Kenny
type consistency of gimple
Mark, I have had some discussions with Honza and Diego about the type consistency at the gimple level. They told me that Andrew was in the process of making gimple properly type consistent. I just wanted to point out how this effects encoding gimple into dwarf. If the gimple is type consistent, then it looks like the only place where I will need to write type references is at CONVERT_EXPRs and NOP_EXPRs. If it is not type consistent, Diego and Honza do not believe that I can properly get away without putting type references at every tree node. This looks like about 20% of the size of writing a function body. I do not know how close Pinskia's project is to completion, but anything that you, or any one else, could do to help will pay off for LTO. It has been suggested that I assume that gimple is type consistent as a way to force the issue. I like this idea, but it is not something that I feel should be kept a secret either. Kenny
Re: type consistency of gimple
Richard Guenther wrote: > On 8/11/06, Kenneth Zadeck <[EMAIL PROTECTED]> wrote: >> Mark, >> >> I have had some discussions with Honza and Diego about the type >> consistency at the gimple level. They told me that Andrew was in the >> process of making gimple properly type consistent. >> >> I just wanted to point out how this effects encoding gimple into dwarf. >> If the gimple is type consistent, then it looks like the only place >> where I will need to write type references is at CONVERT_EXPRs and >> NOP_EXPRs. If it is not type consistent, Diego and Honza do not believe >> that I can properly get away without putting type references at every >> tree node. >> >> This looks like about 20% of the size of writing a function body. I do >> not know how close Pinskia's project is to completion, but anything that >> you, or any one else, could do to help will pay off for LTO. It has >> been suggested that I assume that gimple is type consistent as a way to >> force the issue. I like this idea, but it is not something that I feel >> should be kept a secret either. > > Maybe you can elaborate how you are missing information (assuming > the SSA temporaries still have a type in their "decl" node)? Especially > fold relies on exact matching types for arithmetic operands, the only > place where type transitions are not explicit in all cases is for > MODIFY_EXPRs which can have an implicit type-conversion from > the type of the RHS to the type of the LHS (see > tree_ssa_useless_type_conversion()). > > Richard. Richard, I actually do not know the details. From what I was told by Diego and Honza while we were in Russian is that what you say is in fact the gold standard, and unfortunately gcc does not live up to that standard. The feeling was that Andrew was in fact fixing bugs, not changing the definition of gimple. I am writing this because no one seems to think that Andrew, and the reviewers are really finished with this process and I was just giving a supportive nudge. Kenny
Re: type consistency of gimple
Richard Guenther wrote: > On 8/11/06, Kenneth Zadeck <[EMAIL PROTECTED]> wrote: >> Richard Guenther wrote: >> > On 8/11/06, Kenneth Zadeck <[EMAIL PROTECTED]> wrote: >> >> Mark, >> >> >> >> I have had some discussions with Honza and Diego about the type >> >> consistency at the gimple level. They told me that Andrew was in the >> >> process of making gimple properly type consistent. >> >> >> >> I just wanted to point out how this effects encoding gimple into >> dwarf. >> >> If the gimple is type consistent, then it looks like the only place >> >> where I will need to write type references is at CONVERT_EXPRs and >> >> NOP_EXPRs. If it is not type consistent, Diego and Honza do not >> believe >> >> that I can properly get away without putting type references at every >> >> tree node. >> >> >> >> This looks like about 20% of the size of writing a function body. >> I do >> >> not know how close Pinskia's project is to completion, but >> anything that >> >> you, or any one else, could do to help will pay off for LTO. It has >> >> been suggested that I assume that gimple is type consistent as a >> way to >> >> force the issue. I like this idea, but it is not something that I >> feel >> >> should be kept a secret either. >> > >> > Maybe you can elaborate how you are missing information (assuming >> > the SSA temporaries still have a type in their "decl" node)? >> Especially >> > fold relies on exact matching types for arithmetic operands, the only >> > place where type transitions are not explicit in all cases is for >> > MODIFY_EXPRs which can have an implicit type-conversion from >> > the type of the RHS to the type of the LHS (see >> > tree_ssa_useless_type_conversion()). >> > >> > Richard. >> Richard, >> >> I actually do not know the details. From what I was told by Diego and >> Honza while we were in Russian is that what you say is in fact the gold >> standard, and unfortunately gcc does not live up to that standard. The >> feeling was that Andrew was in fact fixing bugs, not changing the >> definition of gimple. >> >> I am writing this because no one seems to think that Andrew, and the >> reviewers are really finished with this process and I was just giving a >> supportive nudge. > > Ok, maybe it's about time we put the various patches that are floating > around to verify this type correctness on the LTO branch? I know that > at least at some point we could bootstrap with them. > > Richard. It is quite possible that that may be the correct plan. We should wait until some of the gimple elite respond this email. I know that honza's patch has rotted because he tried it in russia and it was doa. I assume that pinskia has a valid one since he has been beating out the varmints.
Re: type consistency of gimple
David Edelsohn wrote: >> Diego Novillo writes: >> > > Diego> Agreed. This is a good opportunity for us to design a GIMPLE type > Diego> system. Besides the obvious space savings and cleanliness, it is also > Diego> needed to remove lang_hooks.types_compatible_p. > > And this last statement is the key point. We can and probably > should implement a prototype LTO without cleaning up the type system, but > for LTO to be usable in a production compiler, the GIMPLE type system will > need to be fixed. Cleaning the type system is not an immediate > requirement -- not a dependency for the initial implementation of LTO -- > but LTO cannot be generally usable without this problem being addressed. > A more compact representation is a byproduct, but multi-language > correctness is necessary for full, production functionality. > > David > > I am modifying my code so that their is a preprocessor flag, STUPID_TYPE_SYSTEM that either writes or does not write the redundant type nodes. The comments by Diego imply that not only is this a space issue but that will have to be solved in order to get rid of one or more of the lang_hooks. Thus, it does make some sense to work on this sooner rather than later. I would suggest that we ask those with patches to strengthen the type system to contribute those patches to the lto branch and for diego (who I believe has the last working type checker) to contribute that type checker to the lto branch. And that we further invite those in the community who are willing to attack some of these problems to contribute those patches to the lto branch as they become available, on the basis that perhaps we can provide some fast path to getting them in the mainline if we are able to solve this problem in a timely manner. When we get to the point where can compile C with the type checker, we should turn it on in the lto branch and remove my STUPID_TYPE_SYSTEM code. This will keep the pressure up but at the same time allow us to get a prototype in a timely manner. Kenny
Re: type consistency of gimple
Nick Clifton wrote: > Hi Diego, > >> Jeff's point about our optimizers is also true. Nick, remember that >> issue with MIPS optimizations you were discussing with Jeff a few days >> ago? I didn't follow most of the details, but it involved ivopts and >> sign issues. Could you send a summary? > > Sure: > > I was looking at how a gcc 4.1 based compiler optimized a fast > fourier transform algorithm for the MIPS target. What I found was the > it was producing much worse code than a similarly configured gcc 3.4 > based compiler, and the culprit was the creation of the induction > variables used in the inner loop. > I think that you are pointing the gun at the wrong suspect. I believe that the true villain is some where down stream in the backend passes that cannot see thru the type conversions. This is one case of us having "degraded" the back end because the middle end likes to break things up into smaller pieces and the back end has too small of a window to look thru to do its work. We should be looking at the back end to see where it cannot see what it needs to see rather than trying to stop getting the middle end code into a reasonable form. > The nested loops look like this: > > signed short i,j,k,DataSizeExponent,DataSize; > signed short RealBitRevData[MAX], ImagBitRevData[MAX]; > signed short * CosineV, * SineV; > > for (k = 1; k <= DataSizeExponent; k++) > { > signed short n1 = 1 << k; > signed short n2 = n1 >> 1; > signed short ArgIndex = 0; > signed short DeltaIndex = (DataSize >> 1) / n2; > > for (j = 0; j < n2; j++) > { > signed int WReal = CosineV[ArgIndex]; > signed int WImag = SineV[ArgIndex]; > > ArgIndex += DeltaIndex; > for (i = j; i < DataSize; i += n1) > { > signed short l = i + n2; > signed int tRealData = (WReal * RealBitRevData[l]) + > (WImag * ImagBitRevData[l]); > signed int tImagData = (WReal * ImagBitRevData[l]) - > (WImag * RealBitRevData[l]); > > tRealData = tRealData >> BUTTERFLY_SCALE_FACTOR; > tImagData = tImagData >> BUTTERFLY_SCALE_FACTOR; > RealBitRevData[l] = RealBitRevData[i] - tRealData; > ImagBitRevData[l] = ImagBitRevData[i] - tImagData; > RealBitRevData[i] += tRealData; > ImagBitRevData[i] += tImagData; > } > } > } > > and the inner loop was being transformed into: > > short int D.2046, D.2043; > long int D.2038, D.2035; > int D.2033, D.2024, D.2020, D.2008, D.2001; >[...] > > :; > D.2033 = (int) (signed short) ivtmp.67; > D.2035 = (long int) RealBitRevData[D.2033]; > D.2038 = (long int) ImagBitRevData[D.2033]; > tRealData = WReal * D.2035 + WImag * D.2038; > tImagData = WReal * D.2038 - WImag * D.2035; > D.2001 = (int) i; > D.2043 = (short int) (tRealData >> 15); > RealBitRevData[D.2033] = RealBitRevData[D.2001] - D.2043; > D.2046 = (short int) (tImagData >> 15); > ImagBitRevData[D.2033] = ImagBitRevData[D.2001] - D.2046; > RealBitRevData[D.2001] = D.2043 + RealBitRevData[D.2001]; > ImagBitRevData[D.2001] = D.2046 + ImagBitRevData[D.2001]; > i = (signed short) ((short unsigned int) i + ivtmp.78); > ivtmp.68 = ivtmp.68 + ivtmp.78; > ivtmp.67 = ivtmp.67 + ivtmp.78; > if (DataSize > (signed short) (ivtmp.68 - ivtmp.78)) goto ; else > goto ; > > > The net result of these induction variables, and especially the type > translations necessary to go between signed and unsigned and shorts > and ints was an inner loop consisting of 42 instructions, as opposed > to an inner loop of 32 instructions as produced by the gcc 3.4 compiler. > > Cheers > Nick >
Re: type consistency of gimple
Diego Novillo wrote: > Kenneth Zadeck wrote on 08/15/06 11:57: > > >> We should be looking at the back end to see where it cannot see what it >> needs to see rather than trying to stop getting the middle end code into >> a reasonable form. >> >> > You're confused. This is a middle-end mis-optimization. However, it is > true that we should probably make the optimizers smarter. > > First, though, we must define a GIMPLE type system. > I most likely am. You are right, we need a type system. kenny
Re: First cut on outputing gimple for LTO using DWARF3. Discussion invited!!!!
Chris Lattner wrote: > On Aug 28, 2006, at 6:57 AM, Kenneth Zadeck wrote: > >> This posting is a progress report of my task of encoding and decoding >> the GIMPLE stream into LTO. Included in this posting is a patch that >> encodes functions and dumps the result to files. > > Interesting email. One question: how big are the files you're > generating? For example, how much space does it take to store the > body of a function with 100 add operations in it (before gzip'ing > it)? What is the size proportional to? # instructions, # basic > blocks, what else? > > -Chris The size will be large because the trees are still large. Encoding everything in a bootstrap for the three languages was on the order of 60mb. As you know, size is a well known problem in gcc. The plan is to track the changes in gimple and as that gets smaller, this will get smaller. I would like to get the types and perhaps the flags out of the internal nodes of the trees. when that gets done, these operations should be small. I actually do not think that it is productive to spend that much time measuring this until we first assure ourselves that we are getting all of the information output correctly. That 60mb number will change a lot (both up and down) as we get closer to getting everything tied together. Then we can spend more time looking at the pieces and seeing where we need to do better.
Re: First cut on outputing gimple for LTO using DWARF3. Discussion invited!!!!
Michael Eager wrote: > Kenneth Zadeck wrote: >> This posting is a progress report of my task of encoding and decoding >> the GIMPLE stream into LTO. Included in this posting is a patch that >> encodes functions and dumps the result to files. > > I have only a limited understanding of GIMPLE and LTO, but here are my > comments regarding DWARF. > > DWARF is a debugging format used to describe a program's source and how > it is represented in an executable file. GIMPLE is an intermediate > representation, not program source. It's always interesting to see one > tool used in a different context, but I'm not clear that in this case > you are using the right tool for the right task in the way that it was > designed. > > Speaking on behalf of the DWARF Standards Workgroup, we welcome > suggestions for extensions or improvements in the DWARF standard. > I wasn't able to identify any specific extension in your email (although > I didn't look at the patch in detail). If there are specific areas > where DWARF can be improved, please let me know. > Michael, My original posting was primarily intended for the gcc developer community which is familiar with gimple and not so familiar with dwarf. I actually think that dwarf is probably pretty good for what it was designed for and it is also documented much better than most of the internals of gcc. The task I was given was to encode gimple into dwarf, something that looked like it might be a reasonable idea, but is, in fact, a bad idea. My posting was to tell the gcc community, been there, done that, now lets move on. I will respond to many of your points but you should not take this as criticism of dwarf, these are mostly just places where the interface between gimple and dwarf was particularly troublesome and unless you want to get in the transporting intermediate code business, which I strongly suggest you avoid, you can just ignore most of these problems. > >> 1) ABBREV TABLES ARE BAD FOR LTO. >> >> The abbrev table is a mechanism for making a DWARF3 file self >> descriptive. It contains a set of templates that are used to describe >> the records and attributes that are written to describe the object >> module. >> >> This self description mechanism is much weaker (an therefor much more >> compact) than an XML file's schemata because it depends on a common >> understanding of the meaning of the tags and attributes that describe >> the pieces of the data. The tags them selves can be thought of as >> characters, and an abbrev entry is a magic cookie that tells you how >> to break the characters into words. > > DWARF is not self-descriptive, except in the very limited sense that > several sections contain version numbers. The DWARF abbreviation table > is not intended to be self-descriptive or to be descriptive of the > DWARF data. It is a simple compression scheme which compacts the > Tag/Attribute/Value structure of a DWARF DIE into a much more condensed > format. > > Certainly the DWARF Abbrev table makes no pretense of being anything > similar to an XML schema. I wouldn't use an XML schema for debugging > information (it would be huge and slow), and I wouldn't expect to use > DWARF abbreviation tables to describe anything at all. > While I understand your point, from a language theoretic point of view, they really can be thought of a schemata. They are a description of the records types that have been written by the compiler (or some other tool) for the tags and attributes in this module. This may not be the way that they evolved in your community, but for someone coming from the outside with no knowledge of your history, this is what they are. >> However, this mechanism is only self descriptive if you do not extend >> the set of tags. That is not an option for LTO. While the DWARF3 >> designers seem to have added every conceivable tag to describe object >> code, it is not be surprising that this comes up short when describing >> GIMPLE. While some of GIMPLE could be shoe-horned into the existing >> tags, all of it will not fit, so anyone who wants to understand >> GIMPLE, will need to use the GCC headers. > > The DWARF standard provides for a consistent means to extend > the debugging format. If you extend DWARF beyond the standard, as has > often been done, you will need to provide descriptions (in some form) > of what these extensions mean. You can provide this as written > material, or you can have the implementer read the code. In either > case, the Abbrev tables only provide compression for the Tag/Attr/Val > structure of a DIE. They do no provide any semantic description. > What I did was add 4 new DW_FORMS. This is one of the areas t
Re: First cut on outputing gimple for LTO using DWARF3. Discussion invited!!!!
Mark Mitchell wrote: > Kenneth Zadeck wrote: > > > >> This >> will be more cumbersome if we have to keep reloading each object >> file's abbrev table just to tear apart a single function in that .o >> file. While the abbrev sections average slightly less than %2 of the >> of the size of the GIMPLE encoding for an entire file, each abbrev table >> averages about the same size as a single function. > > Interesting datapoint. > > (Implied, but not stated, in your mail is the fact that the > abbreviation table cannot be indexed directly. If it could be, then > you wouldn't have to read the entire abbreviation table for each > function; you would just read the referenced abbreviations. Because > the abbreviation table records are of variable length, it is indeed > true that you cannot make random accesses to the table. So, this > paragraph is just fleshing out your argument.) > > I think the conclusion that you reach (that the size of the tables is > a problem) depends on how you expect the compiler to process functions > at link-time. My expectation was that you would form a global > control-flow graph for the entire program (by reading CFG data encoded > in each .o file), eliminate unreachable functions, and then > inline/optimize functions one-at-a-time. > > If you sort the function-reading so that you prefer to read functions > from the same object file in order, then I would expect that you would > considerably reduce the impact of reading the abbreviation tables. > I'm making the assumption that it f calls N functions, then they > probably come from < N object files. I have no data to back up that > assumption. > > (There is nothing that says that you can only have one abbreviation > table for all functions. You can equally well have one abbreviation > table per function. In that mode, you trade space (more abbreviation > tables, and the same abbreviation appearing in multiple tables) > against the fact that you now only need to read the abbreviation > tables you need. I'm not claiming this is a good idea.) > > I don't find this particular argument (that the abbreviation tables > will double file I/O) very convincing. I don't think it's likely that > the problem we're going to have with LTO is running out of *virtual* > memory, especially as 64-bit hardware becomes nearly universal. The > problem is going to be running out of physical memory, and thereby > paging like crazy, running out of D-cache. So, I'd assume you'd just > read the tables as-needed, and never both discarding them. As long as > there is reasonable locality of reference to abbreviation tables > (i.e., you can arrange to hit object files in groups), then the cost > here doesn't seem like it would be very big. > Even if we decide that we are going to process all of the functions in one file at one time, we still have to have access to the functions that are going to be inlined into the function being compiled. Getting at those functions that are going to be inlined is where the double the i/o arguement comes from. I have never depended on the kindness of strangers or the virtues of virtual memory. I fear the size of the virtual memory when we go to compile really large programs. >> 2) I PROMISED TO USE THE DWARF3 STACK MACHINE AND I DID NOT. > > I never imagined you doing this; as per above, I always expected that > you would use DWARF tags for the expression nodes. I agree that the > stack-machine is ill-suited. > >> 3) THERE IS NO COMPRESSION IN DWARF3. > >> In 1 file per mode, zlib -9 compression is almost 6:1. In 1 function >> per mode, zlib -9 compression averages about 3:1. > > In my opinion, if you considered DWARF + zlib to be satisfactory, then > I think that would be fine. For LTO, we're allowed to do whatever we > want. I feel the same about your confession that you invented a new > record form; if DWARF + extensions is a suitable format, that's fine. > In other words, in principle, using a somewhat non-standard variant of > DWARF for LTO doesn't seem evil to me -- if that met our needs. > One of the comments that was made by a person on the dwarf committee is that the abbrev tables really can be used for compression. If you have information that is really common to a bunch of records, you can build an abbrev entry with the common info in it. I have not seen a place where any use can be made of this for encoding gimple except for a couple of places where I have encoded a true or false. I therefor really do not see that they really add anything except for the code to read and write them. >> 2) LOCAL DECLARATIONS >> Mark was going to do all of the types and all of the declar
Re: First cut on outputing gimple for LTO using DWARF3. Discussion invited!!!!
Mark Mitchell wrote: > Kenneth Zadeck wrote: > >> Even if we decide that we are going to process all of the functions in >> one file at one time, we still have to have access to the functions that >> are going to be inlined into the function being compiled. Getting at >> those functions that are going to be inlined is where the double the i/o >> arguement comes from. > > I understand -- but it's natural to expect that those functions will > be clumped together. In a gigantic program, I expect there are going > to be clumps of tightly connected object files, with relatively few > connections between the clumps. So, you're likely to get good cache > behavior for any per-object-file specific data that you need to access. > I just do not know. I assume that you are right, that there is some clumping. But I am just no sure. >> I have never depended on the kindness of strangers or the virtues of >> virtual memory. I fear the size of the virtual memory when we go to >> compile really large programs. > > I don't think we're going to blow out a 64-bit address space any time > soon. Disks are big, but they are nowhere near *that* big, so it's > going to be pretty hard for anyone to hand us that many .o files. > And, there's no point manually reading/writing stuff (as opposed to > mapping it into memory), unless we actually run out of address space. > I am not so concerned with running out of virtual address space than I am about being able to break this up so that it can be done in parallel, on a farm of machines. Otherwise, lto can never be part of anyone's compile/test loop. The notion of having 20 or 50 compile servers each mapping all of the files of a large system in seems like a bad design point. > In fact, if you're going to design your own encoding formats, I would > consider a format with self-relative pointers (or, offsets from some > fixed base) that you could just map into memory. It wouldn't be as > compact as using compression, so the total number of bytes written > when generating the object files would be bigger. But, it will be > very quick to load it into memory. > If you look at my code, that is what I have, at least with respect to the function itself. There is one big difference here between lto and what a debugger needs. I could see designing a debugger (and I have no idea if any such debuggers exist) that simply maps in the debug information and just uses the incore representation as is. Dwarf seems to have been designed to support this. (but then again I could be dreaming). With an intermediate form of a compiler, the usage is quite different. All that we are going to do is load a function, convert it gimple and then throw away (the notion of throw away may not have meaning for memory mapped files) the on disk version. The prime goal is that the format be designed so that an enitity (generally a function) can be expanded into gimple in one pass. Then the question of the benefit of using a compressor comes down to processor speed vs io speed. With the parts that you are in charge of, namely the types and the globals, this is not true. I can very easily see an implementation of the types and decls that is like I describe for the debugger, you map it into mem and just use if from there. But since the intermediate code for a function body is going to get very modified, and our existing gimple is chocked full of pointers, it is harder to envision ever winning at that the mapping game. > I guess my overriding concern is that we're focusing heavily on the > data format here (DWARF? Something else? Memory-mappable? What > compression scheme?) and we may not have enough data. I guess we just > have to pick something and run with it. I think we should try to keep > that code as as separate as possible so that we can recover easily if > whatever we pick turns out to be (another) bad choice. :-) > >> One of the comments that was made by a person on the dwarf committee is >> that the abbrev tables really can be used for compression. If you have >> information that is really common to a bunch of records, you can build >> an abbrev entry with the common info in it. > > Yes. I was a little bit surprised that you don't seem to have seen > much commonality. If you recorded most of the tree flags, and treated > them as DWARF attributes, I'd expect you would see relatively many > expressions of a fixed form. Like, there must be a lot of PLUS_EXPRs > with TREE_USED set on them. But, I gather that you're trying to avoid > recording some of these flags, hoping either that (a) they won't be > needed, or (b) you can recreate them when reading the file. I think > both (a) and (b) hold in many cases, so I think it'
Re: First cut on outputing gimple for LTO using DWARF3. Discussion invited!!!!
Mark Mitchell wrote: > Kenneth Zadeck wrote: > >> I am not so concerned with running out of virtual address space than I >> am about being able to break this up so that it can be done in parallel, >> on a farm of machines. Otherwise, lto can never be part of anyone's >> compile/test loop. > > I think we just expanded the scope of work by an order of magnitude. :-) > > If you had just said that you wanted to support multi-threaded LTO, > that would have been a big deal. But multiple machines with multiple > address spaces trying to do LTO on one program is a really big deal. > (Of course, there is a "cheap hack" way of doing what you want: run > LTO on clumps of object files in parallel, and then just link the > pre-optimized files together in the ordinary way.) > > I'd really like to see us inline a function before we even begin to > have this conversation. :-) I have no intention in expanding the scope of the work. I just am not going to do anything that precludes doing this. As long as the function bodies are in a form that can be easily accessed without reading and processing the entire file, I am fine. > >> I have no idea how "stable" all the types and decls are over a >> compilation. I write my info pretty early, and I assume the types and >> decls are written pretty late in the compilation (otherwise you would >> not have address expressions for the debugger). If there has been any >> "processing" on these between when I write my stuff and when the types >> and decls get written, things may not match up. > > I don't think that this is an issue. The important information about > types and declaration is stable. Things like "is this declaration > used?" change over the course of the compilation, but that's not > useful for DWARF anyhow -- and, in general, we don't write out > information about types/declarations that are entirely unused. The > key aspects (sizes/layouts/etc.) are fixed. > You are either right, or we will fix it. Those are the only two options. My only point was that there is a lot of compiler in between where someone could have done something silly.
Re: First cut on outputing gimple for LTO using DWARF3. Discussion invited!!!!
Daniel Jacobowitz wrote: > On Thu, Aug 31, 2006 at 04:00:25PM -0700, Mark Mitchell wrote: > >> I think this is probably moot, since I believe that Kenny feels DWARF is >> not suitable for reasons other than the abbreviation table issue, but >> this is a clever technique. >> > > ... for GIMPLE; when I discussed with him, I got the impression he was > still open to using it for other things, like types. I may have been > mistaken. > > Given that Mark, and for that matter no one else, did not really push back, I am pretty much committed not to use dwarf. Kenny
Re: First cut on outputing gimple for LTO using DWARF3. Discussion invited!!!!
Daniel Jacobowitz wrote: > On Fri, Sep 01, 2006 at 09:45:34AM -0400, Kenneth Zadeck wrote: > >> Given that Mark, and for that matter no one else, did not really push >> back, I am pretty much committed not to use dwarf. >> > > Then... what are you going to do about things like types? Invent a new > serialization for those too? I think that confusing dwarf-for-types > and dwarf-for-gimple would be a mistake. > > My part is only the function bodies, we are still going to use dwarf for the types and the global variables. There are people at codesoucery who, even as we speak, are busily enhancing that part to get all of the pieces output, not just the parts used for the debugger. Kenny
Re: First cut on outputing gimple for LTO using DWARF3. Discussion invited!!!!
Diego Novillo wrote: > Kenneth Zadeck wrote on 08/28/06 09:57: > > >> I have not done this because I do not rule the earth. That was not >> what I was assigned to do, and I agreed that DWARF3 sounded like a >> reasonable way to go. Now that I understand the details of DWARF3, I >> have changed my mind about the correct direction. Now is the time to >> make that change before there is a lot of infrastructure built that >> assumes the DWARF3 encoding. >> >> > FWIW. I agree with your conclusion. I do not know DWARF3 very well, > but it always felt a bit heavy handed to me. > > At first, I was under the impression that we were going to use DWARF3 to > describe the types and symbol table, and embed our bytecode alongside. > It sounds to me like we want to either invent our own bytecode or adapt > an existing one to our needs. Inventing our own is probably the best > long-term alternative. > > A few comments on the code: > > >> Index: lto-tree-flags.def >> [ ... ] >> +/* This file is designed to be inlined into both the writing and the >> + reading of the lto information. What it does depends on the glue >> + that put in front and at the end of this and how ADD_BASE_FLAG is >> + defined. >> + >> + For those interested in extra credit, this could also be used as >> + the checking code to see if each flag is used correctly. 10 extra >> + credit points will be given for the intrepid programmer who does >> + this and is able to removes the comment that this was generated >> + from in tree.h. */ >> >> > Nothing in this comment helps in understanding what the file is supposed > to do. What is this used for? > When I get the other side of the code finished and enhance the comments, you will see. > >> + switch (code) >> +{ >> +case ABS_EXPR: >> + break; >> + >> +case ADDR_EXPR: >> + break; >> + >> + ADD_BASE_FLAG (static_flag) >> >> > This code is unreachable. There are many instances of this here. > > > I would have found that when I wrote the code that reads this back. >> +/* Return the master clone node of N if it is available and if >> + CHECK_OVERWRITE is true, not overwritable. */ >> + >> > What does it mean to be overwritable? You never seem to call this > function with check_overwrite == false. > > I have no idea what overwritable is used for. This is something that honza and I went around many times on when I was writing my ipa code. It appears that you can put attributes on some functions that allow the function to replaced late, like at link or even runtime. For instance, when I was doing my pure-const analysis, I have to mark these as not being pure or const even if they looked like they could be since they could be replaced by versions that were not pure or const. However, here, I need the master clone node, but I do not care if it is overwritable, I have to put it out anyway. I do call it that way now. Thanks for noticing. > >> +struct char_ptr_base >> +{ >> + char *ptr; >> +}; >> + >> > Hmm? Why? Planning to have something different in the future? > > > >> +/* An incore byte stream to buffer the various parts of the >> +function. The entire structure should be zeroed when created. The >> +record consists of a set of blocks. The first sizeof (ptr) bytes are >> +used as a chain, and the rest store the bytes to be written. */ >> + >> +struct output_stream >> +{ >> + /* The pointer to the first block in the stream. */ >> + struct char_ptr_base * first_block; >> + /* The pointer to the last and current block in the stream. */ >> + struct char_ptr_base * current_block; >> + /* The pointer to where the next char should be written. */ >> + char * current_pointer; >> + /* The number of characters left in the current block. */ >> + unsigned int left_in_block; >> + /* The block size of the last block allocated. */ >> + unsigned int block_size; >> + /* The total number of characters written. */ >> + unsigned int total_size; >> +}; >> + >> > Maybe there's code out there for paging/caching data streams? Though we > would probably want to tweak it quite a bit, it may save some effort for > the base functionality. > > Even if we end up not using DWARF3 as output, the streaming aspect of > this code will remain, I guess? > > Yes, the streaming remains, I need to do it this way so that I can then pass some of the buffers into a compressor. I had though
Re: First cut on outputing gimple for LTO using DWARF3. Discussion invited!!!!
Andrew Pinski wrote: > On Fri, 2006-09-01 at 09:51 -0400, Diego Novillo wrote: > >>> +#if STUPID_TYPE_SYSTEM >>> >>> >> STUPID_TYPE_SYSTEM? No need to be insulting. It's unpleasant. >> > > Well it right now it is stupid, this is just a work around anyways until > people fix the type mismatches really. Is it more insulting than having > stupid.c which existed in GCC before 3.0.0? Also this is very temporary > and will go away when the problem in the rest of the compiler is fixed. > > -- Pinski > > I will tone it down, the point has been made. kenny
this code in fold-const.c:fold_single_bit_test looks wrong to me
if (TREE_CODE (inner) == RSHIFT_EXPR && TREE_CODE (TREE_OPERAND (inner, 1)) == INTEGER_CST && TREE_INT_CST_HIGH (TREE_OPERAND (inner, 1)) == 0 && bitnum < TYPE_PRECISION (type) && 0 > compare_tree_int (TREE_OPERAND (inner, 1), bitnum - TYPE_PRECISION (type))) { bitnum += TREE_INT_CST_LOW (TREE_OPERAND (inner, 1)); inner = TREE_OPERAND (inner, 0); } in particular, in the last stanza of the test TREE_OPERAND (inner, 1) is a positive number from the second stanza. bitnum is also always positive and less than the TYPE_PRECISION (type) from the third stanza, so bitnum - TYPE_PRECISION (type) is always negative, so the compare will always be positive, so this code will never be executed. it is hard to believe that this is what you want. kenny
Re: [wide-int] int_traits
I am a little confused here.what is the reason for doing the the is_sign_extended thing? is the point of deferring the canonization that we can avoid the use of the scratch array until the value is actually used. then we canonicalize on demand? That seems like it is going to require several bits of state: like one to say if we have canonicalized or not. All of these will then have to be checked on every operation. however, i guess if it keeps to alias analyzer sane then I guess it is ok, but it makes everything else a little more complex. Kenny On 10/18/2013 10:57 AM, Richard Sandiford wrote: [off-list] Kenneth Zadeck writes: Richi, Do you want me to back out the patch that changes the rep for unsigned tree-csts? kenny Doesn't look like you're on IRC, so FWIW: richi: just to check, you still want the scratch array to be separate from the other fields, to avoid the other fields escaping, is that right? [11:13] will try that today if so rsandifo: that would be nice, but I'm not settled yet on what to do on ::decompose fully - the idea of a ::is_sign_extended storage-ref flag popped up [11:17] rsandifo: so eventually we don't need a scratch member anymore (which would be best) rsandifo: so if you can give that idea a take instead ... ;) [11:18] today I'll try to clear my patch review pipeline ... yeah. Just to be sure: we still need the extra zero HWI for the large unsigned constants though, right? no, in that case we wouldn't need that we'd have extra (conditional) sext() operations in all sign-dependend ops though [11:19] thus delay the canonicalization until the first use I thought the ::is_sign_extended was dealing with a different case (small_prec). The problem with the extra zero HWI is different. no, it would deal with the case in general and say "if you need a sign-extended rep you have to sign-extend" [11:20] but the point of the extra HWI is that if you extend a 64-bit tree to 128 bits, you have two signficant HWIs. for the fixed_wide_int rep we then want to combine the extension with the copy in its constructor [11:21] For unsigned constants that extra HWI is zero. For signed constants it's minus one. The choice matters in that case, since it affects the 128-bit result of any arithmetic but here the interface is just what limits us - the fact that decompose allows precision != xprecision [11:22] that case should be split out to a different "decompose" it's just needed for fixed_wide_int AFAIK? But the interface was defined that way to avoid constructing fixed_wide_ints for x and y when doing x + y -> fixed_wide_int we could instead do addition with three different precisions (x, y and the result), but that'd get complicated... [11:23] hmm, I see well, then leave the extra zeros alone for now [11:24] ok, works for me, thanks so for this ::is_sign_extended thing, we basically go back to the original "arbitrary upper bits" as the default (conservative) assumption, but add optimisations for the case where ::is_sign_extended is a compile-time true? [11:25] yes OK "arbitrary upper bits" would be either sign- or zero-extended thouhg (but it probaly doesn't make a difference to arbtrary) on a tree wide_int_ref would have that "arbitrary upper bits" that in as far as I can see should get rid of scratch (fingers crossing) [11:27] s/on/only/ Right -- that copy wasn't there in Kenny's original version, it only came in because of the "small_prec having defined upper bits". I agree ::is_sign_extended sounds like a nice compromise [11:28] let's hope it will work out and improve code as desired ;) also the parts Mike suggested, adding more compile-time known optimizations [11:29] yeah and making tree.c predicates use the tree rep directly though then integer_zerop will be magically more efficient than t == 0 ... [11:30] would be nice if we could wi:: to be efficient enough so that we can keep the tree.c bits as-is [11:32] ...could get... [11:33] with CONSTANT (...) tricks or whatever realise that might be a bit optimistic though... yeah [12:41]
suspect code in fold-const.c
in doing the wide int conversion, i have found the following code on the trunk which seems somewhat suspect: this code from fold-const.c starts on line 13811. else if (TREE_INT_CST_HIGH (arg1) == signed_max_hi && TREE_INT_CST_LOW (arg1) == signed_max_lo && TYPE_UNSIGNED (arg1_type) /* We will flip the signedness of the comparison operator associated with the mode of arg1, so the sign bit is specified by this mode. Check that arg1 is the signed max associated with this sign bit. */ && width == GET_MODE_BITSIZE (TYPE_MODE (arg1_type)) /* signed_type does not work on pointer types. */ && INTEGRAL_TYPE_P (arg1_type)) it seems that the check on bitsize should really be a check on the precision of the variable. If this seems right, i will correct this on the trunk and make the appropriate changes to the wide-int branch. kenny
Re: suspect code in fold-const.c
On 11/15/2013 04:07 AM, Eric Botcazou wrote: this code from fold-const.c starts on line 13811. else if (TREE_INT_CST_HIGH (arg1) == signed_max_hi && TREE_INT_CST_LOW (arg1) == signed_max_lo && TYPE_UNSIGNED (arg1_type) /* We will flip the signedness of the comparison operator associated with the mode of arg1, so the sign bit is specified by this mode. Check that arg1 is the signed max associated with this sign bit. */ && width == GET_MODE_BITSIZE (TYPE_MODE (arg1_type)) /* signed_type does not work on pointer types. */ && INTEGRAL_TYPE_P (arg1_type)) with width defined as: unsigned int width = TYPE_PRECISION (arg1_type); it seems that the check on bitsize should really be a check on the precision of the variable. If this seems right, i will correct this on the trunk and make the appropriate changes to the wide-int branch. Do you mean && width == GET_MODE_PRECISION (TYPE_MODE (arg1_type)) instead? If so, that would probably make sense, but there are a few other places with the same TYPE_PRECISION/GET_MODE_BITSIZE check, in particular the very similar transformation done in fold_single_bit_test_into_sign_test. yes. I understand the need to do this check on the mode rather than the precision of the type itself. The point is that if the mode under this type happens to be a partial int mode, then that sign bit may not even be where the bitsize points to. However, having just done a few greps, it looks like this case was just the one that i found while doing the wide-int work, there may be several more of these cases. Just in fold-const, there are a couple in fold_binary_loc. The one in tree.c:int_fits_type_p looks particularly wrong. I think that there are also several in tree-vect-patterns.c. Kenny
Re: suspect code in fold-const.c
This patch fixes a number of places where the mode bitsize had been used but the mode precision should have been used. The tree level is somewhat sloppy about this - some places use the mode precision and some use the mode bitsize. It seems that the mode precision is the proper choice since it does the correct thing if the underlying mode is a partial int mode. This code has been tested on x86-64 with no regressions. Ok to commit? 2013-11-15 Kenneth Zadeck * tree.c (int_fits_type_p): Change GET_MODE_BITSIZE to GET_MODE_PRECISION. * fold-const.c (fold_single_bit_test_into_sign_test, fold_binary_loc): Change GET_MODE_BITSIZE to GET_MODE_PRECISION. Kenny On 11/15/2013 08:32 AM, Kenneth Zadeck wrote: On 11/15/2013 04:07 AM, Eric Botcazou wrote: this code from fold-const.c starts on line 13811. else if (TREE_INT_CST_HIGH (arg1) == signed_max_hi && TREE_INT_CST_LOW (arg1) == signed_max_lo && TYPE_UNSIGNED (arg1_type) /* We will flip the signedness of the comparison operator associated with the mode of arg1, so the sign bit is specified by this mode. Check that arg1 is the signed max associated with this sign bit. */ && width == GET_MODE_BITSIZE (TYPE_MODE (arg1_type)) /* signed_type does not work on pointer types. */ && INTEGRAL_TYPE_P (arg1_type)) with width defined as: unsigned int width = TYPE_PRECISION (arg1_type); it seems that the check on bitsize should really be a check on the precision of the variable. If this seems right, i will correct this on the trunk and make the appropriate changes to the wide-int branch. Do you mean && width == GET_MODE_PRECISION (TYPE_MODE (arg1_type)) instead? If so, that would probably make sense, but there are a few other places with the same TYPE_PRECISION/GET_MODE_BITSIZE check, in particular the very similar transformation done in fold_single_bit_test_into_sign_test. yes. I understand the need to do this check on the mode rather than the precision of the type itself. The point is that if the mode under this type happens to be a partial int mode, then that sign bit may not even be where the bitsize points to. However, having just done a few greps, it looks like this case was just the one that i found while doing the wide-int work, there may be several more of these cases. Just in fold-const, there are a couple in fold_binary_loc. The one in tree.c:int_fits_type_p looks particularly wrong. I think that there are also several in tree-vect-patterns.c. Kenny Index: gcc/fold-const.c === --- gcc/fold-const.c (revision 204842) +++ gcc/fold-const.c (working copy) @@ -6593,7 +6593,7 @@ fold_single_bit_test_into_sign_test (loc /* This is only a win if casting to a signed type is cheap, i.e. when arg00's type is not a partial mode. */ && TYPE_PRECISION (TREE_TYPE (arg00)) - == GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (arg00 + == GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (arg00 { tree stype = signed_type_for (TREE_TYPE (arg00)); return fold_build2_loc (loc, code == EQ_EXPR ? GE_EXPR : LT_EXPR, @@ -12050,7 +12050,7 @@ fold_binary_loc (location_t loc, zerobits = unsigned HOST_WIDE_INT) 1) << shiftc) - 1); else if (TREE_CODE (arg0) == RSHIFT_EXPR && TYPE_PRECISION (TREE_TYPE (arg0)) - == GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (arg0 + == GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (arg0 { prec = TYPE_PRECISION (TREE_TYPE (arg0)); tree arg00 = TREE_OPERAND (arg0, 0); @@ -12061,7 +12061,7 @@ fold_binary_loc (location_t loc, { tree inner_type = TREE_TYPE (TREE_OPERAND (arg00, 0)); if (TYPE_PRECISION (inner_type) - == GET_MODE_BITSIZE (TYPE_MODE (inner_type)) + == GET_MODE_PRECISION (TYPE_MODE (inner_type)) && TYPE_PRECISION (inner_type) < prec) { prec = TYPE_PRECISION (inner_type); @@ -13816,7 +13816,7 @@ fold_binary_loc (location_t loc, associated with the mode of arg1, so the sign bit is specified by this mode. Check that arg1 is the signed max associated with this sign bit. */ - && width == GET_MODE_BITSIZE (TYPE_MODE (arg1_type)) + && width == GET_MODE_PRECISION (TYPE_MODE (arg1_type)) /* signed_type does not work on pointer types. */ && INTEGRAL_TYPE_P (arg1_type)) { Index: gcc/tree.c === --- gcc/tree.c (revision 204842) +++ gcc/tree.c (working copy) @@ -8614,7 +8614,7 @@ retry: /* Third, unsigned integers with top bit set never fit signed types. */ if (! TYPE_UNSIGNED (type) && unsc) { - int prec = GET_MODE
Re: suspect code in fold-const.c
committed as revision 204987. thanks kenny On 11/18/2013 05:38 AM, Richard Biener wrote: On Fri, 15 Nov 2013, Kenneth Zadeck wrote: This patch fixes a number of places where the mode bitsize had been used but the mode precision should have been used. The tree level is somewhat sloppy about this - some places use the mode precision and some use the mode bitsize. It seems that the mode precision is the proper choice since it does the correct thing if the underlying mode is a partial int mode. This code has been tested on x86-64 with no regressions. Ok to commit? Ok. Thanks, Richard. 2013-11-15 Kenneth Zadeck * tree.c (int_fits_type_p): Change GET_MODE_BITSIZE to GET_MODE_PRECISION. * fold-const.c (fold_single_bit_test_into_sign_test, fold_binary_loc): Change GET_MODE_BITSIZE to GET_MODE_PRECISION. Kenny On 11/15/2013 08:32 AM, Kenneth Zadeck wrote: On 11/15/2013 04:07 AM, Eric Botcazou wrote: this code from fold-const.c starts on line 13811. else if (TREE_INT_CST_HIGH (arg1) == signed_max_hi && TREE_INT_CST_LOW (arg1) == signed_max_lo && TYPE_UNSIGNED (arg1_type) /* We will flip the signedness of the comparison operator associated with the mode of arg1, so the sign bit is specified by this mode. Check that arg1 is the signed max associated with this sign bit. */ && width == GET_MODE_BITSIZE (TYPE_MODE (arg1_type)) /* signed_type does not work on pointer types. */ && INTEGRAL_TYPE_P (arg1_type)) with width defined as: unsigned int width = TYPE_PRECISION (arg1_type); it seems that the check on bitsize should really be a check on the precision of the variable. If this seems right, i will correct this on the trunk and make the appropriate changes to the wide-int branch. Do you mean && width == GET_MODE_PRECISION (TYPE_MODE (arg1_type)) instead? If so, that would probably make sense, but there are a few other places with the same TYPE_PRECISION/GET_MODE_BITSIZE check, in particular the very similar transformation done in fold_single_bit_test_into_sign_test. yes. I understand the need to do this check on the mode rather than the precision of the type itself. The point is that if the mode under this type happens to be a partial int mode, then that sign bit may not even be where the bitsize points to. However, having just done a few greps, it looks like this case was just the one that i found while doing the wide-int work, there may be several more of these cases. Just in fold-const, there are a couple in fold_binary_loc. The one in tree.c:int_fits_type_p looks particularly wrong. I think that there are also several in tree-vect-patterns.c. Kenny Index: gcc/tree.c === --- gcc/tree.c (revision 204986) +++ gcc/tree.c (working copy) @@ -8629,7 +8629,7 @@ retry: /* Third, unsigned integers with top bit set never fit signed types. */ if (! TYPE_UNSIGNED (type) && unsc) { - int prec = GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (c))) - 1; + int prec = GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (c))) - 1; if (prec < HOST_BITS_PER_WIDE_INT) { if (unsigned HOST_WIDE_INT) 1) << prec) & dc.low) != 0) Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 204986) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,11 @@ +2013-11-18 Kenneth Zadeck + + * tree.c (int_fits_type_p): Change GET_MODE_BITSIZE to + GET_MODE_PRECISION. + * fold-const.c (fold_single_bit_test_into_sign_test) + (fold_binary_loc): Change GET_MODE_BITSIZE to + GET_MODE_PRECISION. + 2013-11-18 Teresa Johnson * gcc/cfgrtl.c (cfg_layout_initialize): Assert if we Index: gcc/fold-const.c === --- gcc/fold-const.c (revision 204986) +++ gcc/fold-const.c (working copy) @@ -6593,7 +6593,7 @@ fold_single_bit_test_into_sign_test (loc /* This is only a win if casting to a signed type is cheap, i.e. when arg00's type is not a partial mode. */ && TYPE_PRECISION (TREE_TYPE (arg00)) - == GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (arg00 + == GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (arg00 { tree stype = signed_type_for (TREE_TYPE (arg00)); return fold_build2_loc (loc, code == EQ_EXPR ? GE_EXPR : LT_EXPR, @@ -12049,7 +12049,7 @@ fold_binary_loc (location_t loc, zerobits = unsigned HOST_WIDE_INT) 1) << shiftc) - 1); else if (TREE_CODE (arg0) == RSHIFT_EXPR && TYPE_PRECISION (TREE_TYPE (arg0)) - == GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (arg0 + == GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (arg0 { prec = TYPE_PRECISION (TREE_TYPE (arg0));
pr52543
I have figured out what the root cause of pr52543, but i need some advise as to how to fix it. The bug only happens if the source or destination of the move is a hard register. lower-subreg never breaks up pseudo to pseudo moves that are larger than word mode. According to richard sandiford, this bug also appears on the neon, but i do not know if there is a bugzilla for it. It also appears on my private port, which is why i am interested in it. in the particular case of pr52543 and my port, this happens because the input arguments are hard regs. The offending code is in can_decompose_p. The problem is that if the reg is a hard reg, it completely blows off all of the information that it accumulated during the first pass and unconditionally splits the register (assuming it is legal to do so). My question for the list, what is the predicate that we want to replace the code that always decomposes hardregs (assuming it is legal).In the case of the neon and my port, decomposing costs 4x more than using a wide move. I assume the avr is similar. Kenny
Re: pr52543
Ian is certainly correct. I think that the question is really bigger than finding the correct line to fix. The problem is, that this code assumes that machines do not have multiword moves or multiword shifts. My machine has both, and i assume that the avr and the neon have at least multiword moves (but i do not know about the shifts). And as life moves forward, more machines will have these. It seems like the right way to fix this is to somehow enhance the code at the beginning of decompose_multiword_subregs to ask which modes are not cheap to move or shift and then modify the second loop to never lower for those operations for those modes. The question is do i add 2 more target hooks (one for shifting and one for moves) or do i use the rtx_cost mechanism and split for anything over COSTS_N_INSNS (1) or some such? Kenny On 03/20/2012 01:13 AM, Ian Lance Taylor wrote: Kenneth Zadeck writes: I have figured out what the root cause of pr52543, but i need some advise as to how to fix it. The bug only happens if the source or destination of the move is a hard register. lower-subreg never breaks up pseudo to pseudo moves that are larger than word mode. According to richard sandiford, this bug also appears on the neon, but i do not know if there is a bugzilla for it. It also appears on my private port, which is why i am interested in it. in the particular case of pr52543 and my port, this happens because the input arguments are hard regs. The offending code is in can_decompose_p. The problem is that if the reg is a hard reg, it completely blows off all of the information that it accumulated during the first pass and unconditionally splits the register (assuming it is legal to do so). My question for the list, what is the predicate that we want to replace the code that always decomposes hardregs (assuming it is legal).In the case of the neon and my port, decomposing costs 4x more than using a wide move. I assume the avr is similar. I don't think can_decompose_p would be the right thing to change. If that function returns false, resolve_simple_move is still going to split up the move. You need to change resolve_simple_move. In fact, looking at resolve_simple_move, I don't think it will break up the register unless it has already decided to do so: /* If we didn't have any big SUBREGS of decomposed registers, and neither side of the move is a register we are decomposing, then we don't have to do anything here. */ if (src == SET_SRC (set) && dest == SET_DEST (set) && !resolve_reg_p (src) && !resolve_subreg_p (src) && !resolve_reg_p (dest) && !resolve_subreg_p (dest)) { end_sequence (); return insn; } So I think you need to analyze this a bit more. I don't think that is the offending code. Ian
Re: pr52543
i actually care about all registers, not just the hard ones.as it turns out i had been wrong and lower-subregs splits pseudo to pseudo moves, and hard reg to and from psuedo moves. register_move_cost requires the regclasses. anyway that is not the right thing to do for the shifts. kenny On 03/20/2012 09:40 AM, Ian Lance Taylor wrote: Kenneth Zadeck writes: I think that the question is really bigger than finding the correct line to fix. The problem is, that this code assumes that machines do not have multiword moves or multiword shifts. My machine has both, and i assume that the avr and the neon have at least multiword moves (but i do not know about the shifts). And as life moves forward, more machines will have these. It seems like the right way to fix this is to somehow enhance the code at the beginning of decompose_multiword_subregs to ask which modes are not cheap to move or shift and then modify the second loop to never lower for those operations for those modes. The question is do i add 2 more target hooks (one for shifting and one for moves) or do i use the rtx_cost mechanism and split for anything over COSTS_N_INSNS (1) or some such? Why not use REGISTER_MOVE_COST? You only care about hard registers, and all that matters are moves between hard registers and pseudo-regs. So if you find a hard register REG, and if register_move_cost (GET_MODE (reg), REGNO_REG_CLASS (REGNO (REG)), REGNO_REG_CLASS (REGNO (REG))) == 2 then put the pseudo reg into non_decomposable_context. This would be in find_decomposable_subregs. Ian
confusion about fma description in section 16.9 of gccint doc.
Section 16.9 of the current gcc doc is as follows. It implies that the fma pattern should/could be used on a machine that double rounds the multiply add. `fmam4' Multiply operand 2 and operand 1, then add operand 3, storing the result in operand 0. All operands must have mode m. This pattern is used to implement the |fma|, |fmaf|, and |fmal| builtin functions from the ISO C99 standard. The |fma| operation may produce different results than doing the multiply followed by the add if the machine does not perform a rounding step between the operations. However, this gccint doc says that this pattern is to be used to implement the fma* builtin functions of ISO C99. The iso c99 standard states: 7.12.13 Floating multiply-add 7.12.13.1 Thefmafunctions Synopsis 1 #include double fma(double x, double y, double z); float fmaf(float x, float y, float z); long double fmal(long double x, long double y, long double z); Description 2 The fma functions compute (x × y) + z, rounded as one ternary operation: they compute the value (as if) to infinite precision and round once to the result format, according to the rounding mode characterized by the value of FLT_ROUNDS. Returns 3 The fma functions return (x × y) + z, rounded as one ternary operation. Point 2 of the standard above states that those functions only do single rounding. My guess is that the gccint doc is wrong.(I should point out that the gccint doc for the new rtl operator in section 10.9 does require only single rounding. But that is not the question here.) Should i change the section 16.9 doc to indicate that this pattern is only to be used if the machine can do this with a single rounding? kenny
Re: confusion about fma description in section 16.9 of gccint doc.
committed in revision 187494. thanks. On 05/14/2012 08:05 PM, Ian Lance Taylor wrote: Kenneth Zadeck writes: Should i change the section 16.9 doc to indicate that this pattern is only to be used if the machine can do this with a single rounding? Sure. Ian
Re: ARM/getting rid of superfluous zero extension
David and i have been talking about this for some time. what is needed is a real global optimization algorithm. my leaning is to make do it at the rtl level because that is where everything has been exposed. but it would be a lot easier in ssa form. The first step in my opinion is to ask the backend what kind of operations are expensive and what kinds of operations are cheap. for instance some risc machines can do comparisons at any mode and some only do them on the width of a register. also some machines have different costs associated with sign or zero extension. I really think that you not only need to connect up the sources and syncs but you have to understand the space of what kind of transformations changes are going to be cheapest. People do not seem to ask this kind of question at the tree level which is one of the reasons that i lean to doing it at the rtl level. but the chain that goes from a load or some other operation that has a specific width output to the sync that has a specific width input can be a long one and only something equivalent to ssa or dataflow is going to let you connect the true sources with the true syncs. this is the problem with tom's patch.One of the examples that i have seen is a loop that has di mode add of an index variable and (had) an si mode comparison to exit the loop.However, later passes removed the si mode comparison because on the ppc it used the loop count instruction.but the add of the index variable with sign extension remained because it was an index variable and so the value fed back into itself. you are never going clean this up unless you do something global so that you can break the recurrence and see that there is no sync to that web that really needs the value to be si mode. kenny On 10/04/2012 06:16 PM, David Edelsohn wrote: On Thu, Oct 4, 2012 at 2:18 PM, Eric Botcazou wrote: Any suggestion about how I could avoid generating this zero_extension? Redundant extensions have been a hot topic for some time. The combiner should catch the local easy cases, we have ree.c for the nonlocal easy cases and Tom recently posted: http://gcc.gnu.org/ml/gcc-patches/2012-07/msg00408.html which should catch more complex cases. I guess someone should gather the various missed optimization cases, draw a global picture of the situation and see how the various approaches could be fitted together. Kenny already did that as part of the thread with Tom. Tom's patch also is incomplete. - David
possible typo in gcc/java/expr.c at line 1053
this code looks bogus, i think that the "== INTEGER_CST" needs to disappear. kenny tree build_newarray (int atype_value, tree length) { tree type_arg; tree prim_type = decode_newarray_type (atype_value); tree type = build_java_array_type (prim_type, host_integerp (length, 0) == INTEGER_CST ? tree_low_cst (length, 0) : -1);
Re: GCC 4.8.0 Status Report (2012-10-29), Stage 1 to end soon
jakub, i am hoping to get the rest of my wide integer conversion posted by nov 5. I am under some adverse conditions here: hurricane sandy hit her pretty badly. my house is hooked up to a small generator, and no one has any power for miles around. So far richi has promised to review them. he has sent some comments, but so far no reviews.Some time after i get the first round of them posted, i will do a second round that incorporates everyones comments. But i would like a little slack here if possible.While this work is a show stopper for my private port, the patches address serious problems for many of the public ports, especially ones that have very flexible vector units.I believe that there are significant set of latent problems currently with the existing ports that use ti mode that these patches will fix. However, i will do everything in my power to get the first round of the patches posted by nov 5 deadline. kenny On 10/29/2012 01:56 PM, Jakub Jelinek wrote: Status == I'd like to close the stage 1 phase of GCC 4.8 development on Monday, November 5th. If you have still patches for new features you'd like to see in GCC 4.8, please post them for review soon. Patches posted before the freeze, but reviewed shortly after the freeze, may still go in, further changes should be just bugfixes and documentation fixes. Quality Data Priority # Change from Last Report --- --- P1 23 + 23 P2 77 + 8 P3 85 + 84 --- --- Total 185 +115 Previous Report === http://gcc.gnu.org/ml/gcc/2012-03/msg00011.html The next report will be sent by me again, announcing end of stage 1.
Re: GCC 4.8.0 Status Report (2012-10-29), Stage 1 to end soon
Richi, Let me explain to you what a broken api is. I have spent the last week screwing around with tree-vpn and as of last night i finally got it to work. In tree-vpn, it is clear that double-int is the precise definition of a broken api. The tree-vpn uses an infinite-precision view of arithmetic. However, that infinite precision is implemented on top of a finite, CARVED IN STONE, base that is and will always be without a patch like this, 128 bits on an x86-64.However, as was pointed out by earlier, tree-vrp needs 2 * the size of a type + 1 bit to work correctly.Until yesterday i did not fully understand the significance of that 1 bit. what this means is that tree-vrp does not work on an x86-64 with _int128 variables. There are no checks in tree-vrp to back off when it sees something too large, tree-vrp simply gets the wrong answer. To me, this is a broken api and is GCC at its very worst. The patches that required this SHOULD HAVE NEVER GONE INTO GCC. What you have with my patches is someone who is willing to fix a large and complex problem that should have been fixed years ago. I understand that you do not like several aspects of the wide-int api and i am willing to make some of those improvements. However, what i am worried about is that you are in some ways really attached to the style of programmed where everything is dependent on the size of a HWI.I will continue to push back on those comments but have been working the rest in as i have been going along. To answer your other question, it will be a significant problem if i cannot get these patches in. They are very prone to patch rot and my customer wants a product without many patches to the base code. Also, i fear that your real reason that you want to wait is because you really do not like the fact these patches get rid of double in and that style of programming and putting off that day serves no one well. kenny On 10/31/2012 05:59 AM, Richard Sandiford wrote: Richard Biener writes: On Tue, Oct 30, 2012 at 10:05 PM, Kenneth Zadeck wrote: jakub, i am hoping to get the rest of my wide integer conversion posted by nov 5. I am under some adverse conditions here: hurricane sandy hit her pretty badly. my house is hooked up to a small generator, and no one has any power for miles around. So far richi has promised to review them. he has sent some comments, but so far no reviews.Some time after i get the first round of them posted, i will do a second round that incorporates everyones comments. But i would like a little slack here if possible.While this work is a show stopper for my private port, the patches address serious problems for many of the public ports, especially ones that have very flexible vector units.I believe that there are significant set of latent problems currently with the existing ports that use ti mode that these patches will fix. However, i will do everything in my power to get the first round of the patches posted by nov 5 deadline. I suppose you are not going to merge your private port for 4.8 and thus the wide-int changes are not a show-stopper for you. That said, I considered the main conversion to be appropriate to be defered for the next stage1. There is no advantage in disrupting the tree more at this stage. I would like the wide_int class and rtl stuff to go in 4.8 though. IMO it's a significant improvement in its own right, and Kenny submitted it well before the deadline. Richard
Re: GCC 4.8.0 Status Report (2012-10-29), Stage 1 to end soon
jakub my port has 256 bit integers. They are done by strapping together all of the elements of a vector unit. if one looks at where intel is going, they are doing exactly the same thing.The difference is that they like to add the operations one at a time rather than just do a clean implementation like we did. Soon they will get there, it is just a matter of time. i understand the tree-vrp code well enough to say that this operation does not work if you have timode, but i do not know how to translate that back into c to generate a test case.My patch to tree-vrp is adaptable in that it looks at the types in the program and adjusts its definition of infinite precision based on the code that it sees. I can point people to that code in tree vrp and am happy to do that, but that is not my priority now. also, richi pointed out that there are places in the tree level constant propagators that require infinite precision so he is really the person who both should know about this and generate proper tests. kenny On 10/31/2012 09:55 AM, Jakub Jelinek wrote: On Wed, Oct 31, 2012 at 09:44:50AM -0400, Kenneth Zadeck wrote: The tree-vpn uses an infinite-precision view of arithmetic. However, that infinite precision is implemented on top of a finite, CARVED IN STONE, base that is and will always be without a patch like this, 128 bits on an x86-64.However, as was pointed out by earlier, tree-vrp needs 2 * the size of a type + 1 bit to work correctly. Until yesterday i did not fully understand the significance of that 1 bit. what this means is that tree-vrp does not work on an x86-64 with _int128 variables. If you see a VRP bug, please file a PR with a testcase, or point to existing PR. I agree with richi that it would be better to add a clean wide_int implementation for 4.9, rather than rushing something in, introducing lots of bugs, just for a port that hasn't been submitted, nor I understand why > int128_t integer types are so crucial to your port, the vector support doesn't generally need very large integers, even if your vector unit is 256-bit, 512-bit or larger. Jakub
Re: GCC 4.8.0 Status Report (2012-10-29), Stage 1 to end soon
I was not planning to do that mangling for 4.8.My primary justification for getting it in publicly now is that there are a large number of places where the current compiler (both at the tree and rtl levels) do not do optimization of the value is larger than a single hwi.My code generalizes all of these places so that they do the transformations independent of the size of the hwi. (in some cases at the rtl level, the transformations were only done on 32 bit or smaller types, but i have seen nothing like that at the tree level.) This provides benefits for cross compilers and for ports that support timode now. The fact that i have chosen to do it in such a way that we will never have this problem again is the part of the patch that richi seems to object to. We have patches that do the mangling for 256 for the front ends but we figured that we would post those for comments. These are likely to be controversial because the require extensions to the syntax to accept large constants. But there is no reason why the patches that fix the existing problems in a general way should not be considered for this release. Kenny On 10/31/2012 10:27 AM, Jakub Jelinek wrote: On Wed, Oct 31, 2012 at 10:04:58AM -0400, Kenneth Zadeck wrote: if one looks at where intel is going, they are doing exactly the same thing.The difference is that they like to add the operations one at a time rather than just do a clean implementation like we did. Soon they will get there, it is just a matter of time. All I see on Intel is whole vector register shifts (and like on many other ports and/or/xor/andn could be considered whole register too). And, even if your port has 256-bit integer arithmetics, there is no mangling for __int256_t or similar, so I don't see how you can offer such data type as supported in the 4.8 timeframe. Jakub
Re: GCC 4.8.0 Status Report (2012-10-29), Stage 1 to end soon
Jakub, it is hard from all of the threads to actually distill what the real issues are here. So let me start from a clean slate and state them simply. Richi has three primary objections: 1) that we can do all of this with a templated version of double-int. 2) that we should not be passing in a precision and bitsize into the interface. 3) that the interface is too large. I have attached a fragment of my patch #5 to illustrate the main thrust of my patches and to illustrate the usefulness to gcc right now. In the current trunk, we have code that does simplification when the mode fits in an HWI and we have code that does the simplification if the mode fits in two HWIs. if the mode does not fit in two hwi's the code does not do the simplification. Thus here and in a large number of other places we have two copies of the code.Richi wants there to be multiple template instantiations of double-int.This means that we are now going to have to have 3 copies of this code to support oi mode on a 64 bit host and 4 copies on a 32 bit host. Further note that there are not as many cases for the 2*hwi in the code as their are for the hwi case and in general this is true through out the compiler. (CLRSB is missing from the 2hwi case in the patch) We really did not write twice the code when we stated supporting 2 hwi, we added about 1.5 times the code (simplify-rtx is better than most of the rest of the compiler). I am using the rtl level as an example here because i have posted all of those patches, but the tree level is no better. I do not want to write this code a third time and certainly not a fourth time. Just fixing all of this is quite useful now: it fills in a lot of gaps in our transformations and it removes many edge case crashes because ti mode really is lightly tested. However, this patch becomes crucial as the world gets larger. Richi's second point is that we should be doing everything at "infinite precision" and not passing in an explicit bitsize and precision. That works ok (sans the issues i raised with it in tree-vpn earlier) when the largest precision on the machine fits in a couple of hwis.However, for targets that have large integers or cross compilers, this becomes expensive.The idea behind my set of patches is that for the transformations that can work this way, we do the math in the precision of the type or mode. In general this means that almost all of the math will be done quickly, even on targets that support really big integers. For passes like tree-vrp, the math will be done at some multiple of the largest type seen in the actual program.The amount of the multiple is a function of the optimization, not the target or the host. Currently (on my home computer) the wide-int interface allows the optimization to go 4x the largest mode on the target. I can get rid of this bound at the expense of doing an alloca rather than stack allocating a fixed sized structure.However, given the extremely heavy use of this interface, that does not seem like the best of tradeoffs. The truth is that the vast majority of the compiler actually wants to see the math done the way that it is going to be done on the machine. Tree-vrp and the gimple constant prop do not. But i have made accommodations to handle both needs.I believe that the reason that double-int was never used at the rtl level is that it does not actually do the math in a way that is useful to the target. Richi's third objection is that the interface is too large. I disagree. It was designed based on the actual usage of the interface. When i found places where i was writing the same code over and over again, i put it in a function as part of the interface. I later went back and optimized many of these because this is a very heavily used interface. Richi has many other objections, but i have agreed to fix almost all of them, so i am not going to address them here. It really will be a huge burden to have to carry these patched until the next revision. We are currently in stage 1 and i believe that the minor issues that richi raises can be easily addressed. kenny @@ -1373,302 +1411,87 @@ simplify_const_unary_operation (enum rtx_code code, enum machine_mode mode, return CONST_DOUBLE_FROM_REAL_VALUE (d, mode); } - if (CONST_INT_P (op) - && width <= HOST_BITS_PER_WIDE_INT && width > 0) + if (CONST_SCALAR_INT_P (op) && width > 0) { - HOST_WIDE_INT arg0 = INTVAL (op); - HOST_WIDE_INT val; + wide_int result; + enum machine_mode imode = op_mode == VOIDmode ? mode : op_mode; + wide_int op0 = wide_int::from_rtx (op, imode); + +#if TARGET_SUPPORTS_WIDE_INT == 0 + /* This assert keeps the simplification from producing a result + that cannot be represented in a CONST_DOUBLE but a lot of + upstream callers expect that this function never fails to + simplify something and so you if you a