GCC driver to "Compile twice, score the assembly, choose the best"?
Hi, fellow GCC developers! I was wondering if the "gcc" driver could be made to invoke "cc1" twice, with different flags, and then just keep the better of the two .s files that comes out? I'm sure this is not a new idea, but I'm not aware of anything being done in this area, so I've made this post to gather your views. :) The kinds of flags I am thinking could be toggled are register allocation and instruction scheduling ones, since it's very hard to find one-size-fits-all there and we don't really want to have the user depend on knowing the right one. Obviously, compilation time will go up, but the run-time benefits could be huge. What are your thoughts? What work in this area have I failed to dig up in my limited research? Many thanks, Ian
RE: GCC driver to "Compile twice, score the assembly, choose the best"?
Thanks for the quick response. > On Thu, May 15, 2014 at 1:46 PM, Ian Bolton wrote: > > Hi, fellow GCC developers! > > > > I was wondering if the "gcc" driver could be made to invoke > > "cc1" twice, with different flags, and then just keep the > > better of the two .s files that comes out? > > I'd be interested in your .s comparison tool that decides which one > is better! > Well, yes, that's the hard part and it could be done a number of ways. I think it would make a good contest actually, or summer coding project. Or at least a nice subject for brainstorming at the GNU Cauldron. :) Cheers, Ian
RE: Using particular register class (like floating point registers) as spill register class
> On 05/16/2014 12:05 PM, Kugan wrote: > > > > > > On 16/05/14 20:40, pins...@gmail.com wrote: > >> > >> > >>> On May 16, 2014, at 3:23 AM, Kugan > wrote: > >>> > >>> I would like to know if there is anyway we can use registers from > >>> particular register class just as spill registers (in places where > >>> register allocator would normally spill to stack and nothing more), > when > >>> it can be useful. > >>> > >>> In AArch64, in some cases, compiling with -mgeneral-regs-only > produces > >>> better performance compared not using it. The difference here is > that > >>> when -mgeneral-regs-only is not used, floating point register are > also > >>> used in register allocation. Then IRA/LRA has to move them to core > >>> registers before performing operations as shown below. > >> > >> Can you show the code with fp register disabled? Does it use the > stack to spill? Normally this is due to register to register class > costs compared to register to memory move cost. Also I think it > depends on the processor rather the target. For thunder, using the fp > registers might actually be better than using the stack depending if > the stack was in L1. > > Not all the LDR/STR combination match to fmov. In the testcase I > have, > > > > aarch64-none-linux-gnu-gcc sha_dgst.c -O2 -S -mgeneral-regs-only > > grep -c "ldr" sha_dgst.s > > 50 > > grep -c "str" sha_dgst.s > > 42 > > grep -c "fmov" sha_dgst.s > > 0 > > > > aarch64-none-linux-gnu-gcc sha_dgst.c -O2 -S > > grep -c "ldr" sha_dgst.s > > 42 > > grep -c "str" sha_dgst.s > > 31 > > grep -c "fmov" sha_dgst.s > > 105 > > > > I am not saying that we shouldn't use floating point register here. > But > > from the above, it seems like register allocator is using it as more > > like core register (even though the cost mode has higher cost) and > then > > moving the values to core registers before operations. if that is the > > case, my question is, how do we just make this as spill register > class > > so that we will replace ldr/str with equal number of fmov when it is > > possible. > > I'm also seeing stuff like this: > > => 0x7fb72a0928 Thread*)+2500>: > add x21, x4, x21, lsl #3 > => 0x7fb72a092c Thread*)+2504>: > fmov w2, s8 > => 0x7fb72a0930 Thread*)+2508>: > str w2, [x21,#88] > > I guess GCC doesn't know how to store an SImode value in an FP register > into > memory? This is 4.8.1. > Please can you try that on trunk and report back. Thanks, Ian
RE: soft-fp functions support without using libgcc
> On Fri, May 16, 2014 at 6:34 AM, Sheheryar Zahoor Qazi > wrote: > > > > I am trying to provide soft-fp support to a an 18-bit soft-core > > processor architecture at my university. But the problem is that > > libgcc has not been cross-compiled for my target architecture and > some > > functions are missing so i cannot build libgcc.I believe soft-fp is > > compiled in libgcc so i am usable to invoke soft-fp functions from > > libgcc. > > It is possible for me to provide soft-fp support without using > libgcc. > > How should i proceed in defining the functions? Any idea? And does > any > > archoitecture provide floating point support withoput using libgcc? > > I'm sorry, I don't understand the premise of your question. It is not > necessary to build libgcc before building libgcc. That would not make > sense. If you have a working compiler that is missing some functions > provided by libgcc, that should be sufficient to build libgcc. If you replace "cross-compiled" with "ported", I think it makes senses. Can one provide soft-fp support without porting libgcc for their architecture? Cheers, Ian
RE: Using particular register class (like floating point registers) as spill register class
> > > > Please can you try that on trunk and report back. > > OK, this is trunk, and I'm not longer seeing that happen. > > However, I am seeing: > >0x007fb76dc82c <+160>: adrpx25, 0x7fb7c8 >0x007fb76dc830 <+164>: add x25, x25, #0x480 >0x007fb76dc834 <+168>: fmovd8, x0 >0x007fb76dc838 <+172>: add x0, x29, #0x160 >0x007fb76dc83c <+176>: fmovd9, x0 >0x007fb76dc840 <+180>: add x0, x29, #0xd8 >0x007fb76dc844 <+184>: fmovd10, x0 >0x007fb76dc848 <+188>: add x0, x29, #0xf8 >0x007fb76dc84c <+192>: fmovd11, x0 > > followed later by: > >0x007fb76dd224 <+2712>:fmovx0, d9 >0x007fb76dd228 <+2716>:add x6, x29, #0x118 >0x007fb76dd22c <+2720>:str x20, [x0,w27,sxtw #3] >0x007fb76dd230 <+2724>:fmovx0, d10 >0x007fb76dd234 <+2728>:str w28, [x0,w27,sxtw #2] >0x007fb76dd238 <+2732>:fmovx0, d11 >0x007fb76dd23c <+2736>:str w19, [x0,w27,sxtw #2] > > which seems a bit suboptimal, given that these double registers now > have > to be saved in the prologue. > Thanks for doing that. Many AArch64 improvements have gone in since 4.8 was released. I think we'd have to see the output for the whole function to determine whether that code is sane. I don't suppose the source code is shareable or you have a testcase for this you can share? Cheers, Ian
Using -save-temps and @file should also save the intermediate @file used by the driver?
Does anyone have some thoughts they'd like to share on this: "When you compile anything using @file support, the driver assumes @file (at_file_supplied is true) is allowed and may pass options to the linker via @file using a *temporary* file. When -save-temps is also used, the temporary @file passed to the linker should also be saved. Saving the temporary @file passed to the linker allows a developer to re-run just the collect2/ld command. On trunk this means that gcc/gcc.c (create_at_file) should honour the save_temps_flag, saving the temporary @file for later analysis or use." From: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44273
Understanding IRA
Hi Vladimir, I have just begun working for Icera Inc, on a private port of GCC, and my first task has been to investigate the performance impact of using the Chaitin-Briggs algorithm of IRA vs the priority algorithm that we currently have enabled in our production compiler (due to not defining IRA_COVER_CLASSES). After running various tests, I measured little performance changes except for some test cases that have regressed quite considerably. My suspicion was that these regressions were due to our architecture having some instructions that can only use a subset of the register bank, so I set about trying to understand IRA enough to confirm this. I now believe I have now reached a point where I understand IRA enough to start thinking about how I can improve these regressed cases, but I thought it might be a good idea to formulate an email with my current understanding of IRA in it, so you could correct any errors before I continue. (I figure this exchange between us will also serve useful to other newbies in the field of GCC register allocation!) Without further ado, here is my current understanding of IRA (numbered for ease of reference). 1. You can enable full IRA, with Chaitin-Briggs coloring, by defining IRA_COVER_CLASSES, but you can just as easily disable it again by supplying -fira-algorithm=priority at the command line. (The latter is useful to know because it means you can compare two versions without recompiling.) 2. The priority algorithm allocates in an order determined by the allocno_priority_compare_func and calls into assign_hard_reg in that order to determine which allocno gets what register (or memory). 3. The CB algorithm puts allocnos into colored and uncolored buckets, according to conflicts between allocnos. The more conflicts, the more chance of being put in the uncolored bucket. Being in the uncolored bucket increases an allocno's chance of being spilled; being in the colored bucket guarantees that it will get a hard register. 4. When you run at -O0, the conflicts aspect of IRA is disabled (ira_conflict_p is set to false), and this subsequently causes the fast_allocation function to be used instead of the do_coloring IRA function. (This makes sense since you cannot separate into buckets without inspecting conflicts.) 5. Assuming ira_conflicts_p is enabled and you are using the CB algorithm, the two buckets are sorted using a bucket_allocno_compare_func, and this ordering is important, since it dictates the order each allocno will be pushed onto the stack and hence the order it will be popped from it (and assign_hard_reg called). 6. Allocnos from the colorable bucket always go on the stack first (in the order they were sorted) and then uncolorable ones, in an order based on which is the best candidate for spilling. Each time an allocno is added to the stack, conflicts may be reduced and there is the possibility of moving one or more allocnos from the uncolorable bucket to the colorable one (so they end up being added to the stack sooner than the other uncolorable ones). Near the end of processing the uncolorable bucket, there comes an opportunity to move all remaining allocnos to the colorable bucket (due to few remaining conflicts) and these get put on the stack last, and hence popped first, and so they all get registers. These final uncolorables will be those that had the highest spill cost and/or the most remaining conflicts. 7. The spill cost and priority calculation for determining which uncolorable to put on the stack first is important, for two main reasons: firstly, whichever allocno is picked is a more likely candidate for spilling; secondly, it will free other allocnos up to be added to the nicer colored bucket, from which an allocno always receives a hard register. If you could let me know if and where I've gone wrong with any of the above, I would be extremely grateful. Then, once we are on the same page, I hope you could also make some suggestions as to how I might help IRA work well with our instructions that can only use a subset of the register bank. Best regards, Ian Bolton (Compiler Engineer, Icera Inc.)
RE: Understanding IRA
Hi Jeff and Vladimir. Jeff: I'd be interested in trying the patch if you can send it my way. Vladimir: Today I have seen reasonable improvements on progressively-trimmed-down versions of a *single* test case by applying some cost adjustments in the case when an operand wants one of our bottom registers and IRA is considering giving it an alternative register or memory. I will be experimenting with different values for these two new costs and some more test cases tomorrow. I wonder if the optimal register allocator will just end up being one that tries all the algorithms and picks the best output for a given case? When it comes to optimisation, I prefer to bend the rules if at all possible! Best regards, Ian -Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: 16 October 2009 16:45 To: Vladimir Makarov Cc: Ian Bolton; gcc@gcc.gnu.org Subject: Re: Understanding IRA On 10/16/09 08:53, Vladimir Makarov wrote: > > The biggest problem of GCC RA is still reload pass. It frequently > changes decisions of IRA not in a good way. Agreed. Not only may reload make a bad choice, it's horribly unpredictable. Trivial changes often lead to drastically different reloading decisions which in turn drastically change the final output. One of my favorites right now is the round-robin selection of spill registers to encourage reload inheritance. While I certainly understand the reasoning behind the code, it's amazingly frustrating to watch reload choose the worst possible spill register simply because of the round-robin selection. I've got a little hack in the reload-v2 branch which queries IRA to mitigate this problem, but it's merely a short-term hack until I can make reload inheritance go away. jeff
RE: Understanding IRA
Hi again, Vladimir, I am pleased to report some performance improvements after altering ira-costs.c. A key benchmark for us has improved by 5%. Specifically, in record_reg_classes(), after the alt_cost has been calculated and it will be applied to pp->mem_cost and pp->cost[k], I check whether this particular operand wanted one of our BOTTOM_REGS (r0-r15) and I further increase the pp->mem_cost by an arbitrary amount and also increase pp->cost[k] by an arbitrary amount if k does not represent the BOTTOM_REGS class. My aim here is to nudge IRA in the right direction for operands that just want BOTTOM_REGS. After experimenting with different values for my "arbitrary amounts", I discovered some that successfully made IRA more likely to give BOTTOM_REGS to those instructions/operands that want BOTTOM_REGS, since any other regs and memory ended up with high enough costs for IRA to try and avoid using them. I have included a snippet from my version of record_reg_classes() below: op_cost_add = alt_cost * frequency; /* Finally, update the costs with the information we've calculated about this alternative. */ for (i = 0; i < n_ops; i++) if (REG_P (ops[i]) && REGNO (ops[i]) >= FIRST_PSEUDO_REGISTER) { struct costs *pp = op_costs[i], *qq = this_op_costs[i]; int scale = 1 + (recog_data.operand_type[i] == OP_INOUT); /* If this operand really wanted a BOTTOM_REG, add an extra cost onto memory to nudge IRA away from putting it in memory */ if (allocno_pref && allocno_pref[ALLOCNO_NUM(ira_curr_regno_allocno_map [REGNO (ops[i])])] == BOTTOM_REGS) { pp->mem_cost = MIN (pp->mem_cost, (qq->mem_cost + op_cost_add + (flag_ira_preferred_register_cost_memory * frequency)) * scale); } else { pp->mem_cost = MIN (pp->mem_cost, (qq->mem_cost + op_cost_add) * scale); } for (k = 0; k < cost_classes_num; k++) { /* If this operand really wanted a BOTTOM_REG, add an extra cost onto any register class that isn't BOTTOM_REGS to nudge IRA away from putting it in a hard register of that class */ if (allocno_pref && allocno_pref[ALLOCNO_NUM(ira_curr_regno_allocno_map [REGNO (ops[i])])] == BOTTOM_REGS) { switch(cost_classes[k]) { case BOTTOM_REGS: op_cost_add = alt_cost * frequency; break; case TOP_CREGS: case C_REGS: op_cost_add = (alt_cost + flag_ira_preferred_register_cost_register) * frequency; break; default: op_cost_add = alt_cost * frequency; break; } } pp->cost[k] = MIN (pp->cost[k], (qq->cost[k] + op_cost_add) * scale); } } So far, I have found the best value for flag_ira_preferred_register_cost_memory to be 20 and the best value for flag_ira_preferred_register_cost_register to be 6. I appreciate that these numbers do not really correlate with the other cost units but they were the ones that made the impact. In terms of coloring algorithms, we are still seeing better performance with the priority algorithm on our benchmarks, but the cost adjustments above improved both priority algorithm and the CB algorithm, with ira-region=mixed and ira-region=one. If you have any thoughts you'd like to share then I'd definitely be interested, but this post is mainly because you said in a previous email that you wanted to hear my suggestions :) Best regards, Ian >Ian Bolton wrote: >> I hope you could also make some suggestions as to how I might >> help IRA work well with our instructions that can only use a >> subset of the register bank. >> >I forgot to write: thanks, it would be interesting for me to see >your suggestions :)
RE: Understanding IRA
Hi Jeff, From an empirical perspective, the value of your patch is hard to determine at this stage - one benchmark improved about 0.5% but others have marginally regressed. From an intellectual perspective, however, your patch is clearly a Good Thing. If I am understanding correctly, your aim is to prevent cases where reload evicts a pseudo from a register that conflicts with the pseudo that we are trying to reload, thereby undoing the clever cost- based logic in IRA that gave the register to the current owner in the first place? My guess is that the performance drop could be attributed to reload being lucky in some cases and your patch is preventing this luck from happening. Whilst I'm more of a risk-taker by nature, my employer would prefer more predictable behaviour from its compiler, so we will likely commit your patch to our development branch. Thanks very much! Is there an ETA on when reload will be gone away? ;-) Best regards, Ian > -Original Message- > From: Jeff Law [mailto:l...@redhat.com] > Sent: 22 October 2009 23:05 > To: Ian Bolton > Cc: Vladimir Makarov; gcc@gcc.gnu.org > Subject: Re: Understanding IRA > > On 10/19/09 12:30, Ian Bolton wrote: > > Hi Jeff and Vladimir. > > > > Jeff: I'd be interested in trying the patch if you can send it my > way. > > > It's nothing special. > > /* Return nonzero if REGNO is a particularly bad choice for reloading > X. */ > static int > ira_bad_reload_regno_1 (int regno, rtx x) > { >int x_regno; >ira_allocno_t a; >enum reg_class pref; > >/* We only deal with pseudo regs. */ >if (! x || GET_CODE (x) != REG) > return 0; > >x_regno = REGNO (x); >if (x_regno < FIRST_PSEUDO_REGISTER) > return 0; > >/* If the pseudo prefers REGNO explicitly, then do not consider > REGNO a bad spill choice. */ >pref = reg_preferred_class (x_regno); >if (reg_class_size[pref] == 1 > && TEST_HARD_REG_BIT (reg_class_contents[pref], regno)) > return 0; > >/* If the pseudo conflicts with REGNO, then we consider REGNO a > poor choice for a reload regno. */ >a = ira_regno_allocno_map[x_regno]; >if (TEST_HARD_REG_BIT (ALLOCNO_TOTAL_CONFLICT_HARD_REGS (a), regno)) > return 1; > >return 0; > } > > /* Return nonzero if REGNO is a particularly bad choice for reloading > IN or OUT. */ > int > ira_bad_reload_regno (int regno, rtx in, rtx out) > { >return (ira_bad_reload_regno_1 (regno, in) >|| ira_bad_reload_regno_1 (regno, out)); > } > > Then change the loop in allocate_reload_reg to iterate 3 times intead > of > 2. And add this fragment > > > > if (pass == 1 > && ira_bad_reload_regno (regnum, rld[r].in, rld[r].out)) > continue; > > > To body of hte conditional starting with > > if ((reload_reg_free_p ... > > > It's really just a hack. I don't want to spend much time on that code > as ultimately I want it all to go away. > > Jeff > > >
RE: Understanding IRA
Dave Hudson wrote: > On Thu, 2009-11-05 at 18:05 +0000, Ian Bolton wrote: > > I think I may have made a breakthrough! > > > > As mentioned above, IRA is correctly increasing the cost for TOP_REGS > > when an allocno in region 1 is being used in one of our special > > instructions that needs BOTTOM_REGS. We can also see that IRA is > > passing that allocno cost up to the associated allocno in region 0 > and > > altering the total_cost for that allocno. > > > > However, it does not appear as though IRA is doing anything with the > > total_costs, apart from using it to determine the preferred class for > > the allocno. When the main logic of the algorithms comes into play, > > we only look at allocno_costs, which do not take into account the > > future usage of that register in the allocno in region 1. > > > > I believe that if the decisions were made based on total_costs then I > > would get a better allocation with existing IRA. Before I experiment > > with this, I was wondering what the motivation is for only looking > > at allocno_costs? > > > > BTW, I did look into using the ! and * constraints, but I don't think > > they could help. Our architecture is like Motorola 68k in that we > have > > some instructions that can use all the registers and some that can > > only use the subset (BOTTOM_REGS). The latter type can only specify > > "b" for their constraint, since they can only go in BOTTOM_REGS. The > > former type might benefit from "b,!t", so we are more likely to get > > things in BOTTOM_REGS for when they are later needed there, but the > > flip-side is that we might also fill up BOTTOM_REGS when actually > there > > was no subsequent need for the value to be in BOTTOM_REGS. I may > have > > misunderstood how all this works, but it looks like constraints will > > not help here and, in fact, the total_costs calculations that I > > mention above are precisely the way IRA passes information about > > register requirements upwards. > > I've been working on gcc for an ISA (Ubicom32) that seems to have some > similarities to the problem you're seeing (we have some regs that can > be > used for many things but not all) and was seeing a ton a pointless > moves > introduced by reload. > > I've ended up trying quite a few different strategies including > defining > different cover classes and the strategy we're now using is to leave > the > cover classes the same way you have them but found that the most > critical thing was to ensure that REGNO_REG_CLASS was returning a > minimal class correctly. Once I had this right then IRA started to do > the right thing most of the time (-Os still isn't great but -O2 usually > looks very good now). We still see some problems but they're usually a > result of reload messing things up or suboptimal handling of > auto-incrementing in MEMs. Thanks for the info, Dave. Given that C_REGS = r0-r31, BOTTOM_REGS = r0-r15, TOP_CREGS = r16-r31, when you say "minimal class", does that mean that I should change my macro from this: /* Map a register number to a register class. */ #define REGNO_REG_CLASS(REGNO) \ (BOTTOM_C_REG_P (REGNO) ? BOTTOM_REGS : \ (REGNO) == LINK_POINTER_REGNUM ? LINK_REGS : \ C_REG_P (REGNO) ? C_REGS : \ D_REG_P (REGNO) ? D_REGS : \ CSR_REG_P (REGNO) ? CSR_REGS : \ (unsigned)(REGNO) < FIRST_PSEUDO_REGISTER ? INTERNAL_REGS : ALL_REGS) to this (see C_REG_P line for change): /* Map a register number to a register class. */ #define REGNO_REG_CLASS(REGNO) \ (BOTTOM_C_REG_P (REGNO) ? BOTTOM_REGS : \ (REGNO) == LINK_POINTER_REGNUM ? LINK_REGS : \ C_REG_P (REGNO) ? TOP_CREGS :\ D_REG_P (REGNO) ? D_REGS : \ CSR_REG_P (REGNO) ? CSR_REGS : \ (unsigned)(REGNO) < FIRST_PSEUDO_REGISTER ? INTERNAL_REGS : ALL_REGS) I made the change and nothing noticeable happened, but maybe once I fix whatever else needs fixing then the second version of the macro will be better?
RE: Understanding IRA
> On 11/06/09 05:53, Dave Hudson wrote: > > the most > > critical thing was to ensure that REGNO_REG_CLASS was returning a > > minimal class correctly. > I believe that's been documented as the right thing to do for about 15 > years :-) So, yes, you definitely want REGNO_REG_CLASS to return the > smallest class to which a register belongs. > > Jeff So I should definitely change my function then! If something isn't BOTTOM_REGS and it is in C_REGS then the smallest class it's in is actually TOP_CREGS. BTW, today I have achieved some good results with existing IRA by doing the following: 1) Changed the REG_ALLOC_ORDER so that TOP_CREGS are given out before BOTTOM_REGS. My previous hacked version worked by increasing the priority of those that wanted BOTTOM_REGS, so they got first pick; this new version makes them wait their turn, but ensures those with higher priority take TOP_CREGS before BOTTOM_REGS. The analogy, I think, is of giving out meals on an airplane. Most people will eat meat or vegetarian meals, whereas vegetarians only want vegetarian meals. My hacked version was effectively allowing the vegetarians to push to the front of the queue and get their meals; this new version works by encouraging the meat eaters to eat the meaty meals first, leaving more suitable meals for the vegetarians further down the plane. 2) I have forced propagation of allocnos to parent regions with the following hack in find_allocno_class_costs(): { /* Propagate costs to upper levels in the region tree. */ parent_a_num = ALLOCNO_NUM (parent_a); for (k = 0; k < cost_classes_num; k++) COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k] += COSTS_OF_ALLOCNO (total_costs, a_num)->cost[k]; COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost += COSTS_OF_ALLOCNO (total_costs, a_num)->mem_cost; /* BEGIN IGB-IRA CHANGE 2 */ /* Force total_costs to propagate upwards, by setting allocno_costs to be total_costs */ for (k = 0; k < cost_classes_num; k++) COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->cost[k] = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k]; COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->mem_cost = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost; /* END IGB-IRA CHANGE 2 */ } I don't know why propagation isn't happening normally, but this total_costs hack achieves the same thing for me at the moment. Is there any information I could provide to help someone tell me why propagation isn't happening?
RE: Understanding IRA
Yesterday, I wrote: > BTW, today I have achieved some good results with existing IRA by doing > the following: > > 1) Changed the REG_ALLOC_ORDER so that TOP_CREGS are given out before > BOTTOM_REGS. My previous hacked version worked by increasing the > priority of those that wanted BOTTOM_REGS, so they got first pick; this > new version makes them wait their turn, but ensures those with higher > priority take TOP_CREGS before BOTTOM_REGS. > > The analogy, I think, is of giving out meals on an airplane. Most > people will eat meat or vegetarian meals, whereas vegetarians only > want vegetarian meals. My hacked version was effectively allowing > the vegetarians to push to the front of the queue and get their meals; > this new version works by encouraging the meat eaters to eat the meaty > meals first, leaving more suitable meals for the vegetarians further > down the plane. > > 2) I have forced propagation of allocnos to parent regions with the > following hack in find_allocno_class_costs(): > > { > /* Propagate costs to upper levels in the region > tree. */ > parent_a_num = ALLOCNO_NUM (parent_a); > for (k = 0; k < cost_classes_num; k++) > COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k] > += COSTS_OF_ALLOCNO (total_costs, a_num)->cost[k]; > COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost > += COSTS_OF_ALLOCNO (total_costs, a_num)->mem_cost; > /* BEGIN IGB-IRA CHANGE 2 */ > /* Force total_costs to propagate upwards, by setting > allocno_costs to be total_costs */ > for (k = 0; k < cost_classes_num; k++) > COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->cost[k] > = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k]; > COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->mem_cost > = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost; > /* END IGB-IRA CHANGE 2 */ > } > > I don't know why propagation isn't happening normally, but > this total_costs hack achieves the same thing for me at the > moment. Is there any information I could provide to help > someone tell me why propagation isn't happening? Good news! I have been able to remove my "total_costs" hack above by instead commenting out the following line from ira_build() in ira-build.c: remove_unnecessary_regions (false); For my situation at least, this function is preventing the propagation of useful allocno information from region 1 to region 0. Without my change, the allocation for a pseudo in region 0 is not aware of how that pseudo will be used inside a loop in region 1; the real impact of this is that we need to then do a register move *inside the loop* into a valid register class for the instruction in region 1. I do not believe this will impact anyone with a regular register set, but for any architecture where some instructions can only use a subset of the register bank, I believe that all regions are always necessary, since cost information for each allocno is relevant and important. I still need to do some more testing in regards this "fix", but I wanted to put my findings out there as soon as possible for comment from the experts.
RE: Understanding IRA
Vladimir Makarov wrote: > Ian Bolton wrote: > > Yesterday, I wrote: > > > >> BTW, today I have achieved some good results with existing IRA by > doing > >> the following: > >> > >> 1) Changed the REG_ALLOC_ORDER so that TOP_CREGS are given out > before > >> BOTTOM_REGS. My previous hacked version worked by increasing the > >> priority of those that wanted BOTTOM_REGS, so they got first pick; > this > >> new version makes them wait their turn, but ensures those with > higher > >> priority take TOP_CREGS before BOTTOM_REGS. > >> > >> The analogy, I think, is of giving out meals on an airplane. Most > >> people will eat meat or vegetarian meals, whereas vegetarians only > >> want vegetarian meals. My hacked version was effectively allowing > >> the vegetarians to push to the front of the queue and get their > meals; > >> this new version works by encouraging the meat eaters to eat the > meaty > >> meals first, leaving more suitable meals for the vegetarians further > >> down the plane. > >> > >> 2) I have forced propagation of allocnos to parent regions with the > >> following hack in find_allocno_class_costs(): > >> > >> { > >> /* Propagate costs to upper levels in the region > >> tree. */ > >> parent_a_num = ALLOCNO_NUM (parent_a); > >> for (k = 0; k < cost_classes_num; k++) > >> COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k] > >> += COSTS_OF_ALLOCNO (total_costs, a_num)->cost[k]; > >> COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost > >> += COSTS_OF_ALLOCNO (total_costs, a_num)->mem_cost; > >> /* BEGIN IGB-IRA CHANGE 2 */ > >> /* Force total_costs to propagate upwards, by setting > >> allocno_costs to be total_costs */ > >> for (k = 0; k < cost_classes_num; k++) > >> COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->cost[k] > >> = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k]; > >> COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->mem_cost > >> = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost; > >> /* END IGB-IRA CHANGE 2 */ > >> } > >> > >> I don't know why propagation isn't happening normally, but > >> this total_costs hack achieves the same thing for me at the > >> moment. Is there any information I could provide to help > >> someone tell me why propagation isn't happening? > >> > > > > Good news! I have been able to remove my "total_costs" hack > > above by instead commenting out the following line from ira_build() > > in ira-build.c: > > > > remove_unnecessary_regions (false); > > > > For my situation at least, this function is preventing the > > propagation of useful allocno information from region 1 to > > region 0. Without my change, the allocation for a pseudo in > > region 0 is not aware of how that pseudo will be used inside > > a loop in region 1; the real impact of this is that we need > > to then do a register move *inside the loop* into a valid > > register class for the instruction in region 1. > > > > I do not believe this will impact anyone with a regular > > register set, but for any architecture where some instructions > > can only use a subset of the register bank, I believe that > > all regions are always necessary, since cost information > > for each allocno is relevant and important. > > > > I still need to do some more testing in regards this "fix", > > but I wanted to put my findings out there as soon as possible > > for comment from the experts. > > > The function (remove_unnecessary_regions) is used to increase RA speed > by decreasing number of regions. The region is removed if the register > pressure is not high in the region in other words if the probability to > spill pseudos on the region border to improve RA in the region is > small. > > But when the region is removed and correspondingly allocnos (see > function remove_unecessary_allocnos) in the region the info including > hard register costs is propagated to the corresponding allocnos in the > parent region (see function propagate_some_info_from_allocno). > > I've just checked the code it looks ok to me. Ian, it would be nice if > you figure out why the propagation does not happen. As Jeff wrote, > fixing that would be important practically for any target. (I'm in the middle of a hefty compile at the moment, so I can
RE: Understanding IRA
Ian Bolton wrote: > Vladimir Makarov wrote: > > Ian Bolton wrote: > > > Yesterday, I wrote: > > > > > >> BTW, today I have achieved some good results with existing IRA by > > doing > > >> the following: > > >> > > >> 1) Changed the REG_ALLOC_ORDER so that TOP_CREGS are given out > > before > > >> BOTTOM_REGS. My previous hacked version worked by increasing the > > >> priority of those that wanted BOTTOM_REGS, so they got first pick; > > this > > >> new version makes them wait their turn, but ensures those with > > higher > > >> priority take TOP_CREGS before BOTTOM_REGS. > > >> > > >> The analogy, I think, is of giving out meals on an airplane. Most > > >> people will eat meat or vegetarian meals, whereas vegetarians only > > >> want vegetarian meals. My hacked version was effectively allowing > > >> the vegetarians to push to the front of the queue and get their > > meals; > > >> this new version works by encouraging the meat eaters to eat the > > meaty > > >> meals first, leaving more suitable meals for the vegetarians > further > > >> down the plane. > > >> > > >> 2) I have forced propagation of allocnos to parent regions with > the > > >> following hack in find_allocno_class_costs(): > > >> > > >> { > > >> /* Propagate costs to upper levels in the region > > >> tree. */ > > >> parent_a_num = ALLOCNO_NUM (parent_a); > > >> for (k = 0; k < cost_classes_num; k++) > > >> COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k] > > >> += COSTS_OF_ALLOCNO (total_costs, a_num)->cost[k]; > > >> COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost > > >> += COSTS_OF_ALLOCNO (total_costs, a_num)->mem_cost; > > >> /* BEGIN IGB-IRA CHANGE 2 */ > > >> /* Force total_costs to propagate upwards, by setting > > >> allocno_costs to be total_costs */ > > >> for (k = 0; k < cost_classes_num; k++) > > >> COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->cost[k] > > >> = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k]; > > >> COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->mem_cost > > >> = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost; > > >> /* END IGB-IRA CHANGE 2 */ > > >> } > > >> > > >> I don't know why propagation isn't happening normally, but > > >> this total_costs hack achieves the same thing for me at the > > >> moment. Is there any information I could provide to help > > >> someone tell me why propagation isn't happening? > > >> > > > > > > Good news! I have been able to remove my "total_costs" hack > > > above by instead commenting out the following line from ira_build() > > > in ira-build.c: > > > > > > remove_unnecessary_regions (false); > > > > > > For my situation at least, this function is preventing the > > > propagation of useful allocno information from region 1 to > > > region 0. Without my change, the allocation for a pseudo in > > > region 0 is not aware of how that pseudo will be used inside > > > a loop in region 1; the real impact of this is that we need > > > to then do a register move *inside the loop* into a valid > > > register class for the instruction in region 1. > > > > > > I do not believe this will impact anyone with a regular > > > register set, but for any architecture where some instructions > > > can only use a subset of the register bank, I believe that > > > all regions are always necessary, since cost information > > > for each allocno is relevant and important. > > > > > > I still need to do some more testing in regards this "fix", > > > but I wanted to put my findings out there as soon as possible > > > for comment from the experts. > > > > > The function (remove_unnecessary_regions) is used to increase RA > speed > > by decreasing number of regions. The region is removed if the > register > > pressure is not high in the region in other words if the probability > to > > spill pseudos on the region border to improve RA in the region is > > small. > > > > But when the region is removed and correspondingly allocnos (see > > function remove_unecessar
RE: Understanding IRA
Ian Bolton wrote: > Ian Bolton wrote: > > Vladimir Makarov wrote: > > > Ian Bolton wrote: > > > > Yesterday, I wrote: > > > > > > > >> BTW, today I have achieved some good results with existing IRA > by > > > doing > > > >> the following: > > > >> > > > >> 1) Changed the REG_ALLOC_ORDER so that TOP_CREGS are given out > > > before > > > >> BOTTOM_REGS. My previous hacked version worked by increasing > the > > > >> priority of those that wanted BOTTOM_REGS, so they got first > pick; > > > this > > > >> new version makes them wait their turn, but ensures those with > > > higher > > > >> priority take TOP_CREGS before BOTTOM_REGS. > > > >> > > > >> The analogy, I think, is of giving out meals on an airplane. > Most > > > >> people will eat meat or vegetarian meals, whereas vegetarians > only > > > >> want vegetarian meals. My hacked version was effectively > allowing > > > >> the vegetarians to push to the front of the queue and get their > > > meals; > > > >> this new version works by encouraging the meat eaters to eat the > > > meaty > > > >> meals first, leaving more suitable meals for the vegetarians > > further > > > >> down the plane. > > > >> > > > >> 2) I have forced propagation of allocnos to parent regions with > > the > > > >> following hack in find_allocno_class_costs(): > > > >> > > > >> { > > > >> /* Propagate costs to upper levels in the region > > > >> tree. */ > > > >> parent_a_num = ALLOCNO_NUM (parent_a); > > > >> for (k = 0; k < cost_classes_num; k++) > > > >> COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k] > > > >> += COSTS_OF_ALLOCNO (total_costs, a_num)->cost[k]; > > > >> COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost > > > >> += COSTS_OF_ALLOCNO (total_costs, a_num)->mem_cost; > > > >> /* BEGIN IGB-IRA CHANGE 2 */ > > > >> /* Force total_costs to propagate upwards, by setting > > > >> allocno_costs to be total_costs */ > > > >> for (k = 0; k < cost_classes_num; k++) > > > >> COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->cost[k] > > > >> = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k]; > > > >> COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->mem_cost > > > >> = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost; > > > >> /* END IGB-IRA CHANGE 2 */ > > > >> } > > > >> > > > >> I don't know why propagation isn't happening normally, but > > > >> this total_costs hack achieves the same thing for me at the > > > >> moment. Is there any information I could provide to help > > > >> someone tell me why propagation isn't happening? > > > >> > > > > > > > > Good news! I have been able to remove my "total_costs" hack > > > > above by instead commenting out the following line from > ira_build() > > > > in ira-build.c: > > > > > > > > remove_unnecessary_regions (false); > > > > > > > > For my situation at least, this function is preventing the > > > > propagation of useful allocno information from region 1 to > > > > region 0. Without my change, the allocation for a pseudo in > > > > region 0 is not aware of how that pseudo will be used inside > > > > a loop in region 1; the real impact of this is that we need > > > > to then do a register move *inside the loop* into a valid > > > > register class for the instruction in region 1. > > > > > > > > I do not believe this will impact anyone with a regular > > > > register set, but for any architecture where some instructions > > > > can only use a subset of the register bank, I believe that > > > > all regions are always necessary, since cost information > > > > for each allocno is relevant and important. > > > > > > > > I still need to do some more testing in regards this "fix", > > > > but I wanted to put my findings out there as soon as possible > > > > for comment from the experts. > > > > > > > T
RE: Understanding IRA
Jeff Law wrote: > On 11/16/09 10:33, Ian Bolton wrote: > > The question is: how to fix this? I have initially put the > REG_ALLOC_ORDER > > back to how it was and changed the operand constraints in our MD > file, > > so each of the "apathetic" instructions, will either take a 't' > (TOP_CREG) > > or '?b' (BOTTOM_REG). The '?' shows that this alternative is > slightly more > > costly than using 't'. On the benchmark that benefitted the most > from > > the new REG_ALLOC_ORDER, these constraints are almost achieving the > same > > thing. It is only "almost" there because I am struggling with how to > show > > two alternatives for loads and stores, which currently have an 'm' > > constraint. > > > I'm not aware of any way to describe this to IRA. In theory I guess > IRA > could be twiddled to use TARGET_ADDRESS_COST to derive some kind of > cost difference based on the registers used in the MEM, but it seems > rather hackish. I found somewhere in record_address_reg to achieve what I needed: for (k = 0; k < cost_classes_num; k++) { i = cost_classes[k]; pp->cost[k] += (ira_get_may_move_cost (Pmode, i, rclass, true) * scale) / 2; /* Slightly nudge memory addresses away from using BOTTOM_REGS and C_REGS, so they take TOP_CREGS instead - should this pseudo later need BOTTOM_REGS, there will be a higher cost to use TOP_CREGS and it will still get BOTTOM_REGS. This is equivalent to adding a ?b on each instruction that currently has a 'm' constraint. Writing this generically might look something like: pp->cost[k] += TARGET_ADDRESS_EXTRA_COST_P(cost_classes[k]) ? (scale/2) : 0; */ if (cost_classes[k] == BOTTOM_REGS || cost_classes[k] == C_REGS) pp->cost[k] += (scale) / 2; } I was then able to alter all our register-agnostic instructions in our .md file to take either a 't' for TOP_CREGS for a '?b' for BOTTOM_REGS. Initial results showed that IRA was moving input arguments out of their BOTTOM_REGS (e.g. $c1) into TOP_CREGS to do work on them, since it thought TOP_CREGS were less costly to use, despite the cost of the move instruction to get the input argument into a TOP_CREG. I wondered if this was because we add 2 to the alt_cost for each '?' and our REG_MOVE_COST is also 2, but this becomes irrelevant anyway if you do a lot of work with the input argument, since each potential use in a BOTTOM_REG incurs some kind of penalty (one per '?' seen) which will eventually persuade IRA that leaving the input argument in the BOTTOM_REG it arrived in is more costly than moving it to a TOP_CREG. I addressed this problem by splitting my register bank a little differently: instead of making a distinction between BOTTOM_REGS and TOP_CREGS, I made it so there was only a penalty if you used one of the non-argument BOTTOM_REGS (i.e. a callee-save BOTTOM_REG). This meant that IRA was happy to leave input arguments in their BOTTOM_REGS but erred towards using TOP_CREGS once the caller-save BOTTOM_REGS had run out. This was an improvement, but there was still a case where these '?' penalties were not aligned with reality: T1 = A + B; // can use any register, TOP_CREGS appears cheaper T2 = A - C; // can use any register, TOP_CREGS appears cheaper T3 = A & D; // must use BOTTOM_REGS The constraints for the first two instructions show that TOP_CREGS is cheaper, but then you have to plant a move to get A into a BOTTOM_REG to do the AND; in reality, we know it cheaper to have A in a BOTTOM_REG all along, but the '?' constraint suggests there is a cost in doing this for the ADD and SUB and so IRA will put A in a TOP_CREG at first and incur the cost of the move because it is still cheaper than the costs I have defined in with my constraints. I don't believe there is a way to communicate a conditional cost, so I'm thinking that constraints are not the solution for me at this time. What are your thoughts? > You might try something like this: > >1. Crank up the callee-saved register cost adjustment in > assign_hard_reg so that it's scaled based on REG_FREQ. That will > probably lead to some regressions based on my experiments. > >2. Then add a check that completely avoids the cost adjustment in > cases where we pushed a MAY_SPILL_P allocno. This was on my todo list, > but I haven't got to it yet. > > If you wanted to get fancy, you could track the maximum number of > neighbors in each class as allocnos are pushed and use that to adjust > how many registers are cost adjusted in assign_hard_reg. The idea > being > the more neighbors the allocno has, the more callee-saved regsiters > we
Worth balancing the tree before scheduling?
From some simple experiments (see below), it appears as though GCC aims to create a lop-sided tree when there are constants involved (func1 below), but a balanced tree when there aren't (func2 below). Our assumption is that GCC likes having constants all near to each other to aid with tree-based optimisations, but I'm fairly sure that, when it comes to scheduling, it would be better to have a balanced tree, so sched has more choices about what to schedule next? The impact of limiting sched's options can be seen if we look at the pseudo-assembly produced by GCC for our architecture: func1: LSHIFT $c3,$c1,3 # tmp137, a, ADD $c2,$c2,$c3 # tmp138, b, tmp137 ADD $c1,$c2,$c1 #, tmp138, a We think it would be better to avoid using the temporary: func1: ADD $c2,$c1,$c2 # tmp137, a, b LSHIFT $c1,$c1,3 # tmp138, a, ADD $c1,$c2,$c1 # , tmp137, tmp138 As it currently stands, sched doesn't have the option to do this because its input (shown in func.c.172r.asmcons below) is arranged such that the first add depends on the shift and the second add depends on the first add. If the tree were balanced, sched would have the option to do the add first. And, providing the logic was there in sched, we could make it choose to schedule such that we limit the number of temporaries used. Maybe one of the RTL passes prior to scheduling has the potential to balance the tree/RTL, but just isn't on our architecture? == func.c: -- int func1 (int a, int b) { /* the original expression */ return a + b + (a << 3); } int func2 (int a, int b, int c) { /* the original expression */ return a + b + (a << c); } == == func.c.129t.supress_extend: -- ;; Function func1 (func1) func1 (int a, int b) { : return (b + (a << 3)) + a; } func2 (int a, int b, int c) { : return (b + a) + (a << c); } == == func.c.172r.asmcons: -- ;; Function func1 (func1) ;; Pred edge ENTRY [100.0%] (fallthru) (note 5 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK) (insn 2 5 3 2 func.c:2 (set (reg/v:SI 134 [ a ]) (reg:SI 1 $c1 [ a ])) 45 {*movsi} (expr_list:REG_DEAD (reg:SI 1 $c1 [ a ]) (nil))) (note 3 2 4 2 NOTE_INSN_DELETED) (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG) (insn 7 4 8 2 func.c:2 (set (reg:SI 137) (ashift:SI (reg/v:SI 134 [ a ]) (const_int 3 [0x3]))) 80 {ashlsi3} (nil)) (insn 8 7 9 2 func.c:2 (set (reg:SI 138) (plus:SI (reg:SI 2 $c2 [ b ]) (reg:SI 137))) 65 {*addsi3} (expr_list:REG_DEAD (reg:SI 137) (expr_list:REG_DEAD (reg:SI 2 $c2 [ b ]) (nil (note 9 8 14 2 NOTE_INSN_DELETED) (insn 14 9 20 2 func.c:5 (set (reg/i:SI 1 $c1) (plus:SI (reg:SI 138) (reg/v:SI 134 [ a ]))) 65 {*addsi3} (expr_list:REG_DEAD (reg:SI 138) (expr_list:REG_DEAD (reg/v:SI 134 [ a ]) (nil (insn 20 14 0 2 func.c:5 (use (reg/i:SI 1 $c1)) -1 (nil)) ;; Function func2 (func2) ;; Pred edge ENTRY [100.0%] (fallthru) (note 6 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK) (insn 2 6 3 2 func.c:8 (set (reg/v:SI 134 [ a ]) (reg:SI 1 $c1 [ a ])) 45 {*movsi} (expr_list:REG_DEAD (reg:SI 1 $c1 [ a ]) (nil))) (note 3 2 4 2 NOTE_INSN_DELETED) (note 4 3 5 2 NOTE_INSN_DELETED) (note 5 4 8 2 NOTE_INSN_FUNCTION_BEG) (insn 8 5 9 2 func.c:8 (set (reg:SI 138) (plus:SI (reg:SI 2 $c2 [ b ]) (reg/v:SI 134 [ a ]))) 65 {*addsi3} (expr_list:REG_DEAD (reg:SI 2 $c2 [ b ]) (nil))) (insn 9 8 10 2 func.c:8 (set (reg:SI 139) (ashift:SI (reg/v:SI 134 [ a ]) (reg:SI 3 $c3 [ c ]))) 80 {ashlsi3} (expr_list:REG_DEAD (reg/v:SI 134 [ a ]) (expr_list:REG_DEAD (reg:SI 3 $c3 [ c ]) (nil (note 10 9 15 2 NOTE_INSN_DELETED) (insn 15 10 21 2 func.c:11 (set (reg/i:SI 1 $c1) (plus:SI (reg:SI 138) (reg:SI 139))) 65 {*addsi3} (expr_list:REG_DEAD (reg:SI 139) (expr_list:REG_DEAD (reg:SI 138) (nil (insn 21 15 0 2 func.c:11 (use (reg/i:SI 1 $c1)) -1 (nil)) == Cheers, Ian
RE: Worth balancing the tree before scheduling?
David Edelsohn wrote: > On Fri, Nov 20, 2009 at 10:05 AM, Ian Bolton > wrote: > > From some simple experiments (see below), it appears as though GCC > aims > > to > > create a lop-sided tree when there are constants involved (func1 > below), > > but a balanced tree when there aren't (func2 below). > > > > Our assumption is that GCC likes having constants all near to each > other > > to > > aid with tree-based optimisations, but I'm fairly sure that, when it > > comes > > to scheduling, it would be better to have a balanced tree, so sched > has > > more > > choices about what to schedule next? > > I think this would depend on the target architecture and instruction > set: CISC vs RISC, many registers vs few registers, etc. I do not > believe that GCC intentionally is trying to optimize for either, but I > do not think there is a single, right answer. > Regardless of the architecture, I can't see how an unbalanced tree would ever be a good thing. With a balanced tree, you can still choose to process it in either direction (broad versus deep) - whichever is better for your architecture - but, as far as I can see (bearing in mind that I'm very new to GCC development!), a tall lop-sided tree gives few scheduling options due to all the extra dependencies. I guess I must be missing something? Regards, Ian
RE: Understanding IRA
Jeff Law wrote: > Ian Bolton wrote: > > Initial results showed that IRA was moving input arguments out of > their > > BOTTOM_REGS (e.g. $c1) into TOP_CREGS to do work on them, since it > > thought TOP_CREGS were less costly to use, despite the cost of the > move > > instruction to get the input argument into a TOP_CREG. > > > That may indicate a cost scaling issue or more general weaknesses in > IRA's cost modeling. > > > I addressed this problem by splitting my register bank a little > > differently: instead of making a distinction between BOTTOM_REGS and > > TOP_CREGS, I made it so there was only a penalty if you used one of > the > > non-argument BOTTOM_REGS (i.e. a callee-save BOTTOM_REG). This meant > > that IRA was happy to leave input arguments in their BOTTOM_REGS but > > erred towards using TOP_CREGS once the caller-save BOTTOM_REGS had > run > > out. This was an improvement, but there was still a case where these > > '?' penalties were not aligned with reality: > > > > T1 = A + B; // can use any register, TOP_CREGS appears cheaper > > T2 = A - C; // can use any register, TOP_CREGS appears cheaper > > T3 = A& D; // must use BOTTOM_REGS > > > > The constraints for the first two instructions show that TOP_CREGS is > > cheaper, but then you have to plant a move to get A into a BOTTOM_REG > > to do the AND; in reality, we know it cheaper to have A in a > BOTTOM_REG > > all along, but the '?' constraint suggests there is a cost in doing > this > > for the ADD and SUB and so IRA will put A in a TOP_CREG at first and > > incur the cost of the move because it is still cheaper than the costs > I > > have defined in with my constraints. I don't believe there is a way > to > > communicate a conditional cost, so I'm thinking that constraints are > not > > the solution for me at this time. What are your thoughts? > > > See above. This might be a problem with scaling or weaknesses in IRA's > costing model. In theory IRA accumulates the cost of using each class > over the set of insns using a particular pseudo. I had an epiphany this morning and came up with an idea to achieve the lookahead I thought I needed, thereby making the costs created by '?' a lot more concrete and reliable. Firstly, I have altered the alt_cost adjustment (for '?') in ira-costs.c, so that it only happens on the second pass *and* only when the first pass has determined that this allocno never needs a BOTTOM_REG. The first condition (that it only occurs on the second pass) is there so that the preferred class calculated for an allocno is based on hard constraints, as opposed to the fuzzier constraints of '?'. Without this change, the second pass cannot use the preferred class to correctly add costs for each class that doesn't intersect with the preferred class. e.g. If an insn has an allocno as an operand that requires BOTTOM_REGS, then we want the cost for TOP_CREGS for that allocno in that operand to be higher to show the cost of moving it into BOTTOM_REGS. But if we let the '?' constraints dictate the pref class, and this allocno appears in other insns where the '?' constraint has appeared, then TOP_CREGS may end up being the preferred class and so this insn, which actually needs BOTTOM_REGS for its operand, will end increasing the costs for any class that doesn't intersect with TOP_CREGS (i.e. BOTTOM_REGS). I'm thinking that this change will be generally useful. What are your thoughts? The second condition is determined by me storing a new variable in each allocno on the first pass to flag whether it ever appears as an operand that must have BOTTOM_REGS. On the second pass, I can then only penalise an allocno for using my precious BOTTOM_REGS if I have already determined that it will never need them. This change is probably too specific to our case at the moment, but it could probably be made generic, if it has use to other architectures. These two changes together make the example I created above (where A appears in multiple operations) correctly allocate a BOTTOM_REG from the outset. I am yet to run benchmarks with this solution, but I think it will end up being similar to my alternate REG_ALLOC_ORDER, where I just gave out TOP_CREGS first. Sadly, I don't think either approach will handle the case where there is low pressure and we first grab a TOP_CREG for something like an ADD (due to our constraints telling us to) then we free up that register cos it's not needed anymore and then we have to use a different (BOTTOM_REG) reg for something like an AND; we should have just used just one BOTTOM_REG. I guess solving this would involve bigger changes to the algorithm, so that ea
RE: Understanding IRA
Jeff Law wrote: > > > > I had an epiphany this morning and came up with an idea to achieve > the > > lookahead I thought I needed, thereby making the costs created by '?' > a > > lot more concrete and reliable. > > > > Firstly, I have altered the alt_cost adjustment (for '?') in ira- > costs.c, > > so that it only happens on the second pass *and* only when the first > pass > > has determined that this allocno never needs a BOTTOM_REG. > > > > The first condition (that it only occurs on the second pass) is there > so > > that the preferred class calculated for an allocno is based on hard > > constraints, as opposed to the fuzzier constraints of '?'. Without > this > > change, the second pass cannot use the preferred class to correctly > add > > costs for each class that doesn't intersect with the preferred class. > > > > e.g. If an insn has an allocno as an operand that requires > BOTTOM_REGS, > > then we want the cost for TOP_CREGS for that allocno in that operand > to > > be higher to show the cost of moving it into BOTTOM_REGS. But if we > let > > the '?' constraints dictate the pref class, and this allocno appears > in > > other insns where the '?' constraint has appeared, then TOP_CREGS may > > end up being the preferred class and so this insn, which actually > needs > > BOTTOM_REGS for its operand, will end increasing the costs for any > class > > that doesn't intersect with TOP_CREGS (i.e. BOTTOM_REGS). > > > > I'm thinking that this change will be generally useful. What are > your > > thoughts? > > > Step #1 seems a lot like adding '*' before the '?'. The idea behind > '*' was to make it possible to ignore a constraint modifier such as '?' > or '!' when choosing register preferences. I haven't looked at IRA's > implementation of preferencing, but I suspect it mostly lifted the old > allocator's preferencing code, so I'd expect we're still supporting > '*'. I just had a look through the code and '*' means the '?' is completely ignored during costing. A quick grep shows me that '?' would then only have meaning in reload.c, reload1.c and postreload.c. This is the code I have from record_reg_classes() in ira-costs.c: case '*': /* Ignore the next letter for this pass. */ c = *++p; break; The comment suggests it will be ignored in "this pass", but the code makes it get ignored in both passes during ira-costs.c. Maybe this is a bug? Perhaps it should read: case '*': /* Ignore the next letter for this pass. */ if (!allocno_pref) c = *++p; break; If I changed all my '?' constraints to be '*?' then the above change would achieve what my hack does. > > The second condition is determined by me storing a new variable in > each > > allocno on the first pass to flag whether it ever appears as an > operand > > that must have BOTTOM_REGS. On the second pass, I can then only > penalise > > an allocno for using my precious BOTTOM_REGS if I have already > determined > > that it will never need them. > > > > This change is probably too specific to our case at the moment, but > it > > could probably be made generic, if it has use to other architectures. > > > I think we already have code to do something like this via > CONFLICT_HARD_REGNO_COSTS. Though I must admit, I haven't entirely > wrapped my head around those costs and how they're used yet. I was hoping I would be able to avoid all the conflict-related stuff, but this may be where I have to go next, although I am considering returning to my elegant solution of just giving out TOP_REGS first unless someone specifically needs BOTTOM_REGS. This does perform slightly worse for low reg pressure cases (due to the case I mention below), but is the best solution I have come up with for the high pressure cases. > > Sadly, I don't think either approach will handle the case where there > is > > low pressure and we first grab a TOP_CREG for something like an ADD > (due > > to our constraints telling us to) then we free up that register cos > it's > > not needed anymore and then we have to use a different (BOTTOM_REG) > reg > > for something like an AND; we should have just used just one > BOTTOM_REG. > > > That would actually be something helped by the patch we were discussing > in a prior message. The only trick is the two allocnos would have to > be > connected by conflicting with some 3rd allocno. The 3rd allocno may > not > be a high degree, but the heuristic will still trigger and slightly > encourage the ADD and AND allocnos to share the same reg. Yeah, but how do I ensure that the one that needs the BOTTOM_REG gets colored first, thereby leaving its register free to be re-used? This little tweak only kicks in when you are coloring the first of the non- conflicting pair, right? If the agnostic allocno is colored first, it will grab the TOP_REG, thereby causing your tweak to slightly decrem
RE: Understanding IRA (The Story so far)
For those who have been following my adventure in the world of IRA, or just want to take advantage of some of the time I've spent on this, here's a detailed recap of my exploits and my current status (which I'm very happy with): I initially thought IRA was calculating costs wrongly for our architecture and so I experimented with adding more costs in different places. This effectively raised the priority of coloring certain allocnos and it led to good performance on some benchmarks, but it was really a complete hack and not an acceptable solution, so I went back to the drawing board. After chatting with Vlad and Jeff, I realised IRA was calculating costs correctly and the issue was that the costs only reflect the preferences of each allocno, as opposed to their "agnosticism" (i.e. not caring what register they get), and so instructions that will happily take any register would take whichever one you showed them first; this meant that high priority agnostic instructions (that are colored first) were pinching BOTTOM_REGS before the lower-priority special instructions got a chance. To address this, I came up with the simple solution of changing the REG_ALLOC_ORDER, where I show TOP_CREGS (the top half of our register set) to agnostic instructions before the callee-save BOTTOM_REGS (c7-c15) by default, so they take TOP_CREGS when all other costs are equal; meanwhile, allocnos requiring BOTTOM_REGS were given them because they'd been left unallocated and their costs were lower than the TOP_CREGS. This resulted in good performance improvements on some key benchmarks, where previously agnostic instructions were unwittingly using up these special BOTTOM_REGS and leading to us allocating TOP_CREGS for operands in special instructions (which then required spills and moves to get a BOTTOM_REG in time to do the special instruction). The main drawback of this alternate REG_ALLOC_ORDER was that we reduce the chance to reuse a callee-save register if we pick a TOP_CREG, because it doesn't work on all instructions, and I ended up getting regressions on some low-pressure benchmarks because we used more callee-save registers than previously. I then tried various other ways of achieving the performance improvements I got on the high register pressure benchmarks and eliminating the regressions on the low-pressure ones: * I tried using '?' constraints on agnostic instructions, such that they erred to TOP_CREGS instead of callee-save BOTTOM_REGS, but this led to too much bias away from the more reusable BOTTOM_REGS (even when there was no need to avoid them) and I subsequently got regressions; * I tried restricting the effect of the '?', by not using it for the register preferencing pass and by only adding on the alt_cost if I'd determined the allocno never requires BOTTOM_REG, and this got some decent performance, but I was still effectively penalising agnostic instructions for using BOTTOM_REGS regardless of whether others allocnos were demanding them, so I got regressions; * I tried coloring the ones that need BOTTOM_REGS first, which maximises the reuse opportunities, but this was essentially a rejection of the existing tried-and-tested coloring heuristics in IRA and consequently led to some regressions; * I tried combining the new REG_ALLOC_ORDER with a modified coloring heuristic that uses the need for BOTTOM_REGS as a tie-breaker when two allocnos are otherwise indistinguishable, but this was overriding the ALLOCNO_NUM ordering (which I guess represents live range starting points?), so I got some regressions. Fortunately, I was hit by another epiphany on Monday night and I realised I could emulate the new REG_ALLOC_ORDER dynamically within assign_hard_reg and only when necessary. When allocating for an agnostic allocno, if it conflicted with other allocnos that needed BOTTOM_REGS, I added 1 to the full_cost of each of the BOTTOM_REGS so this allocno took a TOP_CREG. This gave me my best performance overall, but there were still some regressions on low-pressure benchmarks. Whilst investigating them, I realised I had gone beyond my original new REG_ALLOC_ORDER because I was now avoiding all BOTTOM_REGS instead of just the callee-save BOTTOM_REGS, so I changed assign_hard_reg so it only adds 1 to each callee-save BOTTOM_REG. This means we happily use up the caller-save BOTTOM_REGS (c1 - c6), regardless of whether the allocno is agnostic or not, and this leads to more reusability in low-pressure cases (whereas avoiding all BOTTOM_REGS for an agnostic instruction decreases reusability). Note that I had been experimenting with Priority algorithm and Chaitin-Briggs up until this point and have decided to stick with Priority for now because we are using Priority already and this minimises the disruption and reduces the risk of regressions on untested inputs. (Results generally show CB to be better performing and smaller code-size, but it suffers badly in very high pressure cases and spills more than Priority.) A
Possible IRA improvements for irregular register architectures
Let's assume I have two sub-classes of ALL_REGS: BOTTOM_REGS (c0-c15) and TOP_CREGS (c16-c31). Let's also assume I have two main types of instruction: A-type Instructions, which can use ALL 32 registers, and B-type Instructions, which can only use the 16 BOTTOM_REGS. IRA will correctly calculate costs (in ira-costs.c) for allocnos appearing in B-type instructions, such that TOP_CREGS has a higher cost than BOTTOM_REGS. It will also calculate costs for the A-type instructions such that TOP_CREGS and BOTTOM_REGS have the same cost. The order of coloring will be determined by the algorithm chosen: Priority or Chaitin-Briggs. As each allocno is colored, the costs will be inspected and the best available hard register will be chosen, mainly based on the register class costs mentioned above, so allocnos in B-type Instructions will usually be assigned a BOTTOM_REG if one is free. If two or more hard registers share the same cost then whichever one appears first in the REG_ALLOC_ORDER will be assigned. (If no hard register can be found, the allocno is assigned memory and will require a "reload" in a later pass to get a hard register.) I do not wish to alter the coloring algorithms or the coloring order. I believe they are good at determing the order to color allocnos, which dictates the likelihood of being assigned a hard register. What I wish to change is the hard register that is assigned, given that the coloring order has determined that this allocno should get one next. Why do I need to do this? Since the BOTTOM_REGS can be used by all instructions, it makes sense to put them first in the REG_ALLOC_ORDER, so we minimise the number of registers consumed by a low-pressure function. But it also makes sense, in high-pressure functions, to steer A-type Instructions away from using BOTTOM_REGS so that they are free for B-type Instructions to use. To achieve this, I tried altering the costs calculated in ira-costs.c, either explicitly with various hacks or by altering operand constraints. The problem with this approach was that it is static and independent, occurring before any coloring order has been determined and without any awareness of the needs of other allocnos. I believe I require a dynamic way to alter the costs, based on which allocnos conflict with the allocno currently being colored and which hard registers are still available at this point. The patch I have attached here is my first reasonable successful attempt at this dynamic approach, which has led to performance improvements on some of our benchmarks and no significant regressions. I am hoping it will be useful to others, but I post it more as a talking point or perhaps to inspire others to come up with better solutions and share them with me :-) bolton_ira_subset_experiments.diff Description: bolton_ira_subset_experiments.diff
RE: Possible IRA improvements for irregular register architectures
Happy New Year! I was hoping for some kind of response to this, but maybe I didn't give enough info? I'd appreciate some pointers on what I could do to prompt some discussion because I have some promising new ideas that lead on from what I've described below. Cheers, Ian > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of > Ian Bolton > Sent: 18 December 2009 15:34 > To: gcc@gcc.gnu.org > Subject: Possible IRA improvements for irregular register architectures > > Let's assume I have two sub-classes of ALL_REGS: BOTTOM_REGS (c0-c15) > and TOP_CREGS (c16-c31). > > Let's also assume I have two main types of instruction: A-type > Instructions, which can use ALL 32 registers, and B-type Instructions, > which can only use the 16 BOTTOM_REGS. > > IRA will correctly calculate costs (in ira-costs.c) for allocnos > appearing in B-type instructions, such that TOP_CREGS has a higher > cost than BOTTOM_REGS. It will also calculate costs for the A-type > instructions such that TOP_CREGS and BOTTOM_REGS have the same cost. > > The order of coloring will be determined by the algorithm chosen: > Priority or Chaitin-Briggs. As each allocno is colored, the costs > will be inspected and the best available hard register will be chosen, > mainly based on the register class costs mentioned above, so allocnos > in B-type Instructions will usually be assigned a BOTTOM_REG if one is > free. If two or more hard registers share the same cost then > whichever one appears first in the REG_ALLOC_ORDER will be assigned. > (If no hard register can be found, the allocno is assigned memory and > will require a "reload" in a later pass to get a hard register.) > > I do not wish to alter the coloring algorithms or the coloring order. > I believe they are good at determing the order to color allocnos, > which dictates the likelihood of being assigned a hard register. What > I wish to change is the hard register that is assigned, given that the > coloring order has determined that this allocno should get one next. > > Why do I need to do this? Since the BOTTOM_REGS can be used by all > instructions, it makes sense to put them first in the REG_ALLOC_ORDER, > so we minimise the number of registers consumed by a low-pressure > function. But it also makes sense, in high-pressure functions, to > steer A-type Instructions away from using BOTTOM_REGS so that they are > free for B-type Instructions to use. > > To achieve this, I tried altering the costs calculated in ira-costs.c, > either explicitly with various hacks or by altering operand > constraints. The problem with this approach was that it is static and > independent, occurring before any coloring order has been determined > and without any awareness of the needs of other allocnos. I believe I > require a dynamic way to alter the costs, based on which allocnos > conflict with the allocno currently being colored and which hard > registers are still available at this point. > > The patch I have attached here is my first reasonable successful > attempt at this dynamic approach, which has led to performance > improvements on some of our benchmarks and no significant > regressions. > > I am hoping it will be useful to others, but I post it more as a > talking point or perhaps to inspire others to come up with better > solutions and share them with me :-)
RE: Possible IRA improvements for irregular register architectures
In the absence of any discussion, I just went ahead and implemented my new ideas, which are attached as a patch! Whilst it is tailored to the specifics of our architecture - where we have 32 C_REGS, of which there are 16 BOTTOM_REGS, and some insns that only work with BOTTOM_REGS - I believe it could be adapted to other architectures where you want to optimise the allocation of any special subsets of your register set. The key file to look at is ira-color.c. I have something called "determine_max_abcd", which considers all conflicting allocnos for the one currently being colored, and finds the maximum depth of overlap of those yet-to-be-colored allocnos that require BOTTOM_REGS and those already-colored allocnos that have already been assigned a BOTTOM_REG. I use this to tell me whether to give out a BOTTOM_REG to an allocno that can use any of our 32 registers. If the ABCD is greater than or equal to the number of BOTTOM_REGS then I adjust the costs of the BOTTOM_REGS to nudge this allocno towards using a TOP_CREG instead. Without this intervention, a future allocno that needed BOTTOM_REGS would not have got one and we would need to plant a spill and fill later to satisfy the needs of its instruction. This improves performance on all benchmarks I have run on our architecture. I still have other ideas for making it even better, but I figured now was a good time to share what I have. Comments and questions would be greatly appreciated. Cheers, Ian > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of > Ian Bolton > Sent: 04 January 2010 14:19 > To: gcc@gcc.gnu.org > Subject: RE: Possible IRA improvements for irregular register > architectures > > Happy New Year! > > I was hoping for some kind of response to this, but maybe I didn't give > enough info? I'd appreciate some pointers on what I could do to prompt > some discussion because I have some promising new ideas that lead on > from what I've described below. > > Cheers, > Ian > > > -Original Message----- > > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf > Of > > Ian Bolton > > Sent: 18 December 2009 15:34 > > To: gcc@gcc.gnu.org > > Subject: Possible IRA improvements for irregular register > architectures > > > > Let's assume I have two sub-classes of ALL_REGS: BOTTOM_REGS (c0-c15) > > and TOP_CREGS (c16-c31). > > > > Let's also assume I have two main types of instruction: A-type > > Instructions, which can use ALL 32 registers, and B-type > Instructions, > > which can only use the 16 BOTTOM_REGS. > > > > IRA will correctly calculate costs (in ira-costs.c) for allocnos > > appearing in B-type instructions, such that TOP_CREGS has a higher > > cost than BOTTOM_REGS. It will also calculate costs for the A-type > > instructions such that TOP_CREGS and BOTTOM_REGS have the same cost. > > > > The order of coloring will be determined by the algorithm chosen: > > Priority or Chaitin-Briggs. As each allocno is colored, the costs > > will be inspected and the best available hard register will be > chosen, > > mainly based on the register class costs mentioned above, so allocnos > > in B-type Instructions will usually be assigned a BOTTOM_REG if one > is > > free. If two or more hard registers share the same cost then > > whichever one appears first in the REG_ALLOC_ORDER will be assigned. > > (If no hard register can be found, the allocno is assigned memory and > > will require a "reload" in a later pass to get a hard register.) > > > > I do not wish to alter the coloring algorithms or the coloring order. > > I believe they are good at determing the order to color allocnos, > > which dictates the likelihood of being assigned a hard register. > What > > I wish to change is the hard register that is assigned, given that > the > > coloring order has determined that this allocno should get one next. > > > > Why do I need to do this? Since the BOTTOM_REGS can be used by all > > instructions, it makes sense to put them first in the > REG_ALLOC_ORDER, > > so we minimise the number of registers consumed by a low-pressure > > function. But it also makes sense, in high-pressure functions, to > > steer A-type Instructions away from using BOTTOM_REGS so that they > are > > free for B-type Instructions to use. > > > > To achieve this, I tried altering the costs calculated in ira- > costs.c, > > either explicitly with various hacks or by altering operand > > constraints. The problem with this approach was that it is static > and > > independent, occurring before any coloring ord
Possible IRA bug in assign_hard_reg
Near the end of assign_hard_reg in ira-color.c, there is this code: if (min_full_cost > mem_cost) { if (! retry_p && internal_flag_ira_verbose > 3 && ira_dump_file != NULL) fprintf (ira_dump_file, "(memory is more profitable %d vs %d) ", mem_cost, min_full_cost); best_hard_regno = -1; } If retry_p is true then we are in reload, so I wouldn't expect us to override best_hard_regno in this case. I think the code should read: if (min_full_cost > mem_cost && ! retry_p) { if (internal_flag_ira_verbose > 3 && ira_dump_file != NULL) fprintf (ira_dump_file, "(memory is more profitable %d vs %d) ", mem_cost, min_full_cost); best_hard_regno = -1; } I'm probably wrong, but I wanted to check. Cheers, Ian
RE: Possible IRA bug in assign_hard_reg
Thanks for the detailed answer. While we're on the subject of assign_hard_reg, I notice the costs and min_cost variable are set but never used (decisions are being made with the full_costs array and min_full_cost). Should they be referenced somehow or are they just redundant? Cheers, Ian > -Original Message- > From: Vladimir Makarov [mailto:vmaka...@redhat.com] > Sent: 21 January 2010 21:08 > To: Ian Bolton > Cc: gcc@gcc.gnu.org > Subject: Re: Possible IRA bug in assign_hard_reg > > Ian Bolton wrote: > > Near the end of assign_hard_reg in ira-color.c, there is this code: > > > > > > if (min_full_cost > mem_cost) > > { > > if (! retry_p && internal_flag_ira_verbose > 3 && ira_dump_file > != > > NULL) > > fprintf (ira_dump_file, "(memory is more profitable %d vs %d) ", > > mem_cost, min_full_cost); > > best_hard_regno = -1; > > } > > > > > > If retry_p is true then we are in reload, so I wouldn't expect us to > > override best_hard_regno in this case. I think the code should read: > > > > > > if (min_full_cost > mem_cost && ! retry_p) > > { > > if (internal_flag_ira_verbose > 3 && ira_dump_file != NULL) > > fprintf (ira_dump_file, "(memory is more profitable %d vs %d) ", > > mem_cost, min_full_cost); > > best_hard_regno = -1; > > } > > > > > > I'm probably wrong, but I wanted to check. > > > > > The original code is ok, I think. > > First, retry_p is true not only from reload. It is true when we are > trying to assign hard register after IR flattening during which new > allocnos can be created to resolve cycles for register shuffling on > region borders. > > If memory is more profitable we should use independently from where we > call assign_hard_reg. > > Retry_p is used as guard for dump printing because it is a part of dump > when allocno is popped from coloring stack (in this case retry_p is > false).
Question about IRA (setup_allocno_priorities)
Hi again, Thanks for your answer to my other question. I just found a case, where an allocno wasn't getting a register, when I thought it should, since it was referenced 24 times. I looked in setup_allocno_priorities() to see how this was used to determine the priority and found this line: mult = floor_log2 (ALLOCNO_NREFS (a)) + 1; All other things being equal, this means something that is referenced 16 times will get the same priority as something referenced 31 times. I just changed this line to remove the floor_log2 and got better results on my test case. I'm wondering what the origins of the floor_log2 are? You scale everything based on max_priority and INT_MAX later, so I don't see the harm in using the nrefs as the multiplier in the first instance, and I actually think this would be more representative of the impact of an allocno not getting a register. NOTE: I am still using the Priority algorithm because that performs better than CB on our architecture at the moment. Cheers, Ian
RE: GCC freezing for a Multiply Chain
Hi Balaji. If you have access to GCC source and are willing to change it, there is a way to turn off IRA (the Integrated Register Allocator) and just use a "fast allocation" algorithm that otherwise only gets used for -O0. Should IRA be the issue, this may be a workaround for you in the short-term: In GCC version 4.4.1, the last function of ira-color.c is called ira-color(). Simply make it always call fast_allocation instead of only if ira_conflict_p is false. Best regards, Ian > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of > balaji.i...@gtri.gatech.edu > Sent: 15 February 2010 17:04 > To: gcc@gcc.gnu.org > Subject: GCC freezing for a Multiply Chain > > Hello Everyone, > I am creating a benchmark, where I have a following code: > > x1 = a * b > x2 = x1 * a; > x3 = x1* x2; > x4 = x2 * x3; > x5 = x3 * x4; > x6 = x5 * x6; > . . . > . . . >x1000 = x999 * x998; > > > > When I do this, and compile using -O3/-O2/-O1, the compiler freezes. > Am I doing anything wrong? > > GCC Version: 4.2.1 > Platform: x86_64 (in macOS Snow Leopard). > > I also tested this on a i386 (Red Hat 4.3.2-7) machine with gcc 4.3.2 > and I had the same response. > > Please CC me in the response since I m not a subscribed member of this > mailing list. > > Any help is highly appreciated! > > Thanks, > > > Balaji V. Iyer.
Function versioning tests?
Hi there, I've changed our private port of GCC to give versioned functions better names (rather than T.0, T.1), and was wondering if there are any existing tests that push function-versioning to the limit, so I can test whether my naming scheme is sound. Failing that, I'd appreciate some pointers on how I might make such a test. I know I need to be passing a constant in as a parameter, but I don't know what other criteria are required to make it trigger. Cheers, Ian
RE: Function versioning tests?
Hi Bingfeng. Thanks for pointing me at that patch, but it doesn't do what I require, which is probably fortunate because I would have wasted work otherwise! My change incorporates the callee function name and caller function name into the new name (e.g. bar.0.foo), so that we can trace that name back to the original non-versioned function in the source, without requiring additional debugging information; a standard build contains all the info we need because it is held in the function names. Are there any versioning tests for the patch you mention? Cheers, Ian > -Original Message- > From: Bingfeng Mei [mailto:b...@broadcom.com] > Sent: 22 February 2010 09:58 > To: Ian Bolton; gcc@gcc.gnu.org > Subject: RE: Function versioning tests? > > Hi, > GCC 4.5 already contains such patch. http://gcc.gnu.org/ml/gcc- > patches/2009-03/msg01186.html > If you are working on 4.4 branch, you can just apply the patch without > problem. > > Cheers, > Bingfeng > > > -Original Message- > > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On > > Behalf Of Ian Bolton > > Sent: 19 February 2010 17:09 > > To: gcc@gcc.gnu.org > > Subject: Function versioning tests? > > > > Hi there, > > > > I've changed our private port of GCC to give versioned functions > > better names (rather than T.0, T.1), and was wondering if there > > are any existing tests that push function-versioning to the limit, > > so I can test whether my naming scheme is sound. > > > > Failing that, I'd appreciate some pointers on how I might make > > such a test. I know I need to be passing a constant in as a > > parameter, but I don't know what other criteria are required to > > make it trigger. > > > > Cheers, > > Ian > > > >
RE: Coloring problem - Pass 0 for finding allocno costs
> The problem I see is that for registers 100,101 I get best register > class D instead of R - actually they get the same cost and D is chosen > (maybe because it is first). Hi Frank. Do D and R overlap? It would be useful to know which regs are in which class, before trying to understand what is going on. Can you paste an example of your define_insn from your MD file to show how operands from D or R are both valid? I ask this because it is possible to express that D is more expensive than R with operand constraints. For general IRA info, you might like to look over my long thread on here called "Understanding IRA". Cheers, Ian
RE: Coloring problem - Pass 0 for finding allocno costs
> -Original Message- > From: Frank Isamov [mailto:frank.isa...@gmail.com] > Sent: 18 March 2010 14:29 > To: Ian Bolton > Subject: Re: Coloring problem - Pass 0 for finding allocno costs > > On Thu, Mar 18, 2010 at 3:51 PM, Ian Bolton > wrote: > >> The problem I see is that for registers 100,101 I get best register > >> class D instead of R - actually they get the same cost and D is > chosen > >> (maybe because it is first). > > > > Hi Frank. > > > > Do D and R overlap? It would be useful to know which regs are in > > which class, before trying to understand what is going on. > > > > Can you paste an example of your define_insn from your MD file to > show > > how operands from D or R are both valid? I ask this because it is > > possible to express that D is more expensive than R with operand > > constraints. > > > > For general IRA info, you might like to look over my long thread on > > here called "Understanding IRA". > > > > Cheers, > > Ian > > > > Hi Ian, > > Thank you very much for your prompt reply. > D and R are not overlap. Please see fragments of .md and .h files > below: > > From the md: > > (define_register_constraint "a" "R_REGS" "") > (define_register_constraint "d" "D_REGS" "") > > (define_predicate "a_operand" > (match_operand 0 "register_operand") > { > unsigned int regno; > if (GET_CODE (op) == SUBREG) > op = SUBREG_REG (op); > regno = REGNO (op); > return (regno >= FIRST_PSEUDO_REGISTER || > REGNO_REG_CLASS(regno) == R_REGS); > } > ) > > (define_predicate "d_operand" > (match_operand 0 "register_operand") > { > unsigned int regno; > if (GET_CODE (op) == SUBREG) > op = SUBREG_REG (op); > regno = REGNO (op); > return (regno >= FIRST_PSEUDO_REGISTER || > REGNO_REG_CLASS(regno) == D_REGS); > } > ) > > (define_predicate "a_d_operand" > (ior (match_operand 0 "a_operand") >(match_operand 0 "d_operand"))) > > (define_predicate "i_a_d_operand" > (ior (match_operand 0 "immediate_operand") >(match_operand 0 "a_d_operand"))) > > (define_insn "mov_regs" > [(set (match_operand:SISFM 0 "a_d_operand" "=a, a, a, d, d, d") > (match_operand:SISFM 1 "i_a_d_operand" "a, i, d, a, > i, d"))] > "" > "move\t%1, %0" > ) > > (define_insn "*addsi3_1" > [(set (match_operand:SI 0 "a_d_operand""=a, a, a,d,d") > (plus:SI (match_operand:SI 1 "a_d_operand" "%a, a, a,d,d") > (match_operand:SI 2 "nonmemory_operand" > "U05,S16,a,d,U05")))] > "" > "adda\t%2, %1, %0" > ) > > ;; Arithmetic Left and Right Shift Instructions > (define_insn "3" > [(set (match_operand:SCIM 0 "register_operand" "=a,d,d") > (sh_oprnd:SCIM (match_operand:SCIM 1 "register_operand" > "a,d,d") > (match_operand:SI 2 > "nonmemory_operand" "U05,d,U05"))) >(clobber (reg:CC CC_REGNUM))] > "" > "\t%2, %1, %0" > ) > > From the h file: > > #define REG_CLASS_CONTENTS > \ > { > \ > {0x, 0x, 0x}, /* NO_REGS*/ \ > {0x, 0x, 0x}, /* D_REGS*/ \ > {0x, 0x, 0x}, /* R_REGS*/ \ > > ABI requires use of R registers for arguments and return value. Other > than that all of these instructions are more or less symmetrical in > sense of using D or R. So, an optimal choice would be use of R for > this example. And if D register is chosen, it involves additional copy > from R to D and back to R. > > Thank you, Frank Hi Frank, Have you defined REG_ALLOC_ORDER? All other things being equal, IRA will choose the register that comes first in the REG_ALLOC_ORDER. In terms of operand constraints, if you replace 'd' with '?d' in each of your operand constraints, it will show IRA that using D_REGS is slightly more expensive than using R_REGS. This may not always be true, however, if the value being put in the D_REG never needs to be returned, so this might cause IRA to use up all the R_REGS first with things that need not be there. I didn't like using constraints because of this issue - the extra cost sometimes incurred is conditional, and you can't represent that in the constraints. Have you defined REGISTER_MOVE_COST? In defaults.h, it is set to 2, which should be what you are looking for. If you can get IRA to see that the best COVER_CLASS is R_REGS then, when D_REGS is given, there will be an added cost of 2 for that option. One other thing: Don't overlook Pass 1. That's where the real costs are calculated. I hope some of this helps. If not, you'll have to send me your test program and ira dump file. Best regards, Ian
RE: Coloring problem - Pass 0 for finding allocno costs
> -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of > Michael Matz > Sent: 18 March 2010 15:13 > To: Frank Isamov > Cc: gcc@gcc.gnu.org > Subject: Re: Coloring problem - Pass 0 for finding allocno costs > > Hi, > > On Thu, 18 Mar 2010, Frank Isamov wrote: > > > From the h file: > > > > #define REG_CLASS_CONTENTS > \ > > { > > \ > > {0x, 0x, 0x}, /* NO_REGS*/ \ > > {0x, 0x, 0x}, /* D_REGS*/ \ > > {0x, 0x, 0x}, /* R_REGS*/ \ > > > > ABI requires use of R registers for arguments and return value. Other > > than that all of these instructions are more or less symmetrical in > > sense of using D or R. > > If that is so, you're better off not using two different register > classes > at all. Register classes are there to describe assymetry in the ISA > register file. For describing ABI constraints like argument or > caller-save registers look at FUNCTION_ARG, FUNCTION_ARG_REGNO_P, > FUNCTION_VALUE_REGNO_P and CALL_USED_REGISTERS (and some friends). > > The compiler will e.g. try to make sure to first use call-clobbered > registers in leaf functions, so that they don't need to be > saved/restored > in the prologue/epilogue. You don't need register classes to describe > this. Good spot! But ... In the original post, Frank mentioned that all reg classes in an insn must match ("all registers in an insn should be from the same register class"), so I don't think a single class would suffice because there would be no way to enforce the above requirement. However, I think a superclass that contains both would work, and the CALL_USED_REGISTERS define can be used to show just the R_REGS. Meanwhile, the matching class requirement can still be enforced via the operand constraints. I might be wrong, however. I am still learning about IRA too, so feel free to teach me something new ;-)
Understanding Scheduling
Hi folks! I've moved on from register allocation (see Understanding IRA thread) and onto scheduling. In particular, I am investigating the effectiveness of the sched1 pass on our architecture and the associated interblock-scheduling optimisation. Let's start with sched1 ... For our architecture at least, it seems like Richard Earnshaw is right that sched1 is generally bad when you are using -Os, because it can increase register pressure and cause extra spill/fill code when you move independent instructions in between dependent instructions. For example: LOAD c2,c1[0] LOAD c3,c1[1] ADD c2,c2,c3 # depends on LOAD above it (might stall) LOAD c3,c1[2] ADD c2,c2,c3 # depends on LOAD above it (might stall) LOAD c3,c1[3] ADD c2,c2,c3 # depends on LOAD above it (might stall) LOAD c3,c1[4] ADD c2,c2,c3 # depends on LOAD above it (might stall) might become: LOAD c2,c1[0] LOAD c3,c1[1] LOAD c4,c1[2] # independent of first two LOADS LOAD c5,c1[3] # independent of first two LOADS ADD c2,c2,c3 # not dependent on preceding two insns (avoids stall) LOAD c3,c1[4] ADD c2,c2,c4 # not dependent on preceding three insns (avoids stall) ... This is a nice effect if your LOAD instructions have a latency of 3, so this should lead to performance increases, and indeed this is what I see for some low-reg-pressure Nullstone cases. Turning sched1 off therefore causes a regression on these cases. However, this pipeline-type effect may increase your register pressure such that caller-save regs are required and extra spill/fill code needs to be generated. This happens for some other Nullstone cases, and so it is good to have sched1 turned off for them! It's therefore looking like some kind of clever hybrid is required. I mention all this because I was wondering which other architectures have turned off sched1 for -Os? More importantly, I was wondering if anyone else had considered creating some kind of clever hybrid that only uses sched1 when it will increase performance without increasing register pressure? Or perhaps I could make a heuristic based on the balanced-ness of the tree? (I see sched1 does a lot better if the tree is balanced, since it has more options to play with.) Now onto interblock-scheduling ... As we all know, you can't have interblock-scheduling enabled unless you use the sched1 pass, so if sched1 is off then interblock is irrelevant. For now, let's assume we are going to make some clever hybrid that allows sched1 when we think it will increase performance for Os and we are going to keep sched1 on for O2 and O3. As I understand it, interblock-scheduling enlarges the scope of sched1, such that you can insert independent insns from a completely different block in between dependent insns in this block. As well as potentially amortizing stalls on high latency insns, we also get the chance to do "meatier" work in the destination block and leave less to do in the source block. I don't know if this is a deliberate effect of interblock-scheduling or if it is just a happy side-effect. Anyway, the reason I mention interblock-scheduling is that I see it doing seemingly intelligent moves, but then the later BB-reorder pass is juggling blocks around such that we end up with extra code inside hot loops! I assume this is because the scheduler and BB-reorderer are largely ignorant of each other, and so good intentions on the part of the former can be scuppered by the latter. I was wondering if anyone else has witnessed this madness on their architecture? Maybe it is a bug with BB-reorder? Or maybe it should only be enabled when function profiling information (e.g. gcov) is available? Or maybe it is not a high-priority thing for anyone to think about because no one uses interblock-scheduling? If anyone can shed some light on the above, I'd greatly appreciate it. For now, I will continue my experiments with selective enabling of sched1 for -Os. Best regards, Ian
RE: Understanding Scheduling
> > Let's start with sched1 ... > > > > For our architecture at least, it seems like Richard Earnshaw is > > right that sched1 is generally bad when you are using -Os, because > > it can increase register pressure and cause extra spill/fill code > when > > you move independent instructions in between dependent instructions. > > Please note that Vladimir Makarov implemented register pressure-aware > sched1 > for GCC 4.5, activated with -fsched-pressure. I thought I should > mention > this because your e-mail omits it completely, so it's hard to tell > whether you > tested it. > Hi Alexander, We are on GCC 4.4 at the moment. I don't see us moving up in the near future, but I could apply the patch and see what it does for us. Cheers, Ian
RE: Understanding Scheduling
> > I mention all this because I was wondering which other architectures > > have turned off sched1 for -Os? More importantly, I was wondering > > if anyone else had considered creating some kind of clever hybrid > > that only uses sched1 when it will increase performance without > > increasing register pressure? > > > > > http://gcc.gnu.org/ml/gcc-patches/2009-09/msg3.html > > Another problem is that sched1 for architectures with few registers can > result in reload failure. I tried to fix this in the patch mentioned > above but I am not sure it is done for all targets and all possible > programs. The right solution for this would be implementing hard > register spills in the reload. I don't think we have so few registers that reload failure will occur, so it might be worth me trying this. > > The mentioned above code does not work for RA based on priority > coloring > because register pressure calculation for intersected or nested classes > has a little sense. Hmm. Thanks for mentioning that. As you might recall, we are using priority coloring at the moment because it yielded better performance than Chaitin-Briggs. Well, the real reason CB was rejected was that we were already using Priority and so moving to CB would cause sufficient disruption such that performance increases and decreases would be inevitable. I did get some gains, but I also got regressions that we couldn't absorb at the time I did the work. I might be revisiting CB soon though, as it did tend to yield smaller code, which is becoming more important to us. > > If scheduling for the target is very important (as for itanium or > in-order execution power6), I'd recommend to look at the selective > scheduler. I don't think scheduling is highly important for us, but I will take a look at the selective scheduler. > > > Or perhaps I could make a heuristic based on the balanced-ness of the > > tree? (I see sched1 does a lot better if the tree is balanced, since > > it has more options to play with.) > > > > > > > The register pressure is already mostly minimized when shed1 starts to > work. I guess this is a factor in the unbalancedness of the tree. The more you balance it, the more likely it will get wider and require more registers. But the wider and more balanced, the more options for sched1 and the more chance of a performance win (assuming the increase in reg pressure does not outweigh the scheduling performance win.) > > Now onto interblock-scheduling ... > > > > As we all know, you can't have interblock-scheduling enabled unless > > you use the sched1 pass, so if sched1 is off then interblock is > > irrelevant. For now, let's assume we are going to make some clever > > hybrid that allows sched1 when we think it will increase performance > > for Os and we are going to keep sched1 on for O2 and O3. > > > > As I understand it, interblock-scheduling enlarges the scope of > > sched1, such that you can insert independent insns from a > > completely different block in between dependent insns in this > > block. As well as potentially amortizing stalls on high latency > > insns, we also get the chance to do "meatier" work in the destination > > block and leave less to do in the source block. I don't know if this > > is a deliberate effect of interblock-scheduling or if it is just > > a happy side-effect. > > > > Anyway, the reason I mention interblock-scheduling is that I see it > > doing seemingly intelligent moves, but then the later BB-reorder pass > > is juggling blocks around such that we end up with extra code inside > > hot loops! I assume this is because the scheduler and BB-reorderer > > are largely ignorant of each other, and so good intentions on the > > part of the former can be scuppered by the latter. > > > > > That is right. It would be nice if somebody solves the problem. Hmm. If we keep sched1 on, then maybe I will be the man to do it! Best regards, Ian
RE: Understanding Scheduling
> Enabling BB-reorder only if profile info is available, is not the > right way to go. The compiler really doesn't place blocks in sane > places without it -- and it shouldn't have to, either. For example if > you split an edge at some point, the last thing you want to worry > about, is where the new basic block is going to end up. > > There are actually a few bugs in Bugzilla about BB-reorder, FWIW. I've done a few searches in Bugzilla and am not sure if I have found the BB reorder bugs you are referring to. The ones I have found are: 16797: Opportunity to remove unnecessary load instructions 41396: missed space optimization related to basic block reorder 21002: RTL prologue and basic-block reordering pessimizes delay-slot filling. (If you can recall any others, I'd appreciate hearing of them.) Based on 41396, it looks like BB reorder is disabled for -Os. But you said in your post above that "the compiler really doesn't place blocks in sane places without it", so does that mean that we could probably increase performance for -Os if BB reorder was (improved) and enabled for -Os? Cheers, Ian
BB reorder forced off for -Os
Is there any reason why BB reorder has been disabled in bb-reorder.c for -Os, such that you can't even turn it on with -freorder-blocks? From what I've heard on this list in recent days, BB reorder gives good performance wins such that most people would still want it on even if it did increase code size a little. Cheers, Ian
RE: BB reorder forced off for -Os
> We're not able to enable BB reordering with -Os. The behaviour is > hard-coded via this if statement in rest_of_handle_reorder_blocks(): > > if ((flag_reorder_blocks || flag_reorder_blocks_and_partition) > /* Don't reorder blocks when optimizing for size because extra > jump insns may >be created; also barrier may create extra padding. > >More correctly we should have a block reordering mode that tried > to >minimize the combined size of all the jumps. This would more or > less >automatically remove extra jumps, but would also try to use more > short >jumps instead of long jumps. */ > && optimize_function_for_speed_p (cfun)) > { > reorder_basic_blocks (); > > If you comment out the "&& optimize_function_for_speed_p (cfun)" then > BB reordering takes places as desired (although this isn't a solution > obviously). > > In a private message Ian indicated that this had a small impact for the > ISA he's working with but a significant performance gain. I tried the > same thing with the ISA I work on (Ubicom32) and this change typically > increased code sizes by between 0.1% and 0.3% but improved performance > by anything from 0.8% to 3% so on balance this is definitely winning > for most of our users (this for a couple of benchmarks, the Linux > kernel, busybox and smbd). > It should be noted that commenting out the conditional to do with optimising for speed will make BB reordering come on for all functions, even cold ones, so I think whatever gains have come from making this hacky change could increase further if BB reordering is set to only come on for hot functions when compiling with -Os. (Certainly the code size increases could be minimised, whilst hopefully retaining the performance gains.) Note that I am in no way suggesting this should be the default behaviour for -Os, but that it should be switchable via the flags just like other optimisations are. But, once it is switchable, I expect choosing to turn it on for -Os should not cause universal enabling of BB reordering for every function (as opposed to the current universal disabling of BB reordering for every function), but a sensible half-way point, based on heat, so that you get the performance wins with minimal code size increases on selected functions. Cheers, Ian
Making BB Reorder Smarter for -Os
Bugzilla 41004 calls for a more -Os-friendly algorithm for BB Reorder, and I'm hoping I can meet this challenge. But I need your help! If you have any existing ideas or thoughts that could help me get closer to a sensible heuristic sooner, then please post them to this list. In the mean time, here's my progress so far: My first thought was to limit BB Reorder to hot functions, as identified by gcov, so we could get maximum execution time benefit for minimised code size impact. Based on benchmarking I've done so far, this looks to be a winner, but it only works when profile data is available. Without profile data, we face two main problems: 1) We cannot easily estimate how many times we will execute our more efficient code-layout, so we can't do an accurate trade-off versus the code size increase. 2) The traces that BB Reorder constructs will be based on static branch prediction, rather than true dynamic flows of execution, so the new layout may not be the best one in practice. We can address #1 by tagging functions as hot (using attributes), but that may not always be possible and it does not guarantee that we will get minimal codesize increases, which is the main aim of this work. I'm not sure how #2 can be addressed, so I'm planning to sidestep it completely, since the problem isn't really the performance pay-off but the codesize increase that usually comes with each new layout of a function that BB Reorder makes. My current plan is to characterise a function within find_traces() (looking at things like the number of traces, edge probabilities and frequencies, etc) and only call connect_traces() to effect the reordering change if these characteristics suggest that minimal code disruption will occur and/or maximum performance pay-off. Thanks for reading and I look forward to your input! Cheers, Ian
Question about tree-switch-conversion.c
I am in the process of fixing PR44328 (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44328) The problem is that gen_inbound_check in tree-switch-conversion.c subtracts info.range_min from info.index_expr, which can cause the MIN and MAX values for info.index_expr to become invalid. For example: typedef enum { FIRST = 0, SECOND, THIRD, FOURTH } ExampleEnum; int dummy (const ExampleEnum e) { int mode = 0; switch (e) { case SECOND: mode = 20; break; case THIRD: mode = 30; break; case FOURTH: mode = 40; break; } return mode; } tree-switch-conversion would like to create a lookup table for this, so that SECOND maps to entry 0, THIRD maps to entry 1 and FOURTH maps to entry 2. It achieves this by subtracting SECOND from index_expr. The problem is that after the subtraction, the type of the result can have a value outside the range 0-3. Later, when tree-vrp.c sees the inbound check as being <= 2 with a possible range for the type as 0-3, it converts the <=2 into a != 3, which is totally wrong. If e==FIRST, then we can end up looking for entry 255 in the lookup table! I think the solution is to update the type of the result of the subtraction to show that it is no longer in the range 0-3, but I have had trouble implementing this. The attached patch (based off 4.5 branch) shows my current approach, but I ran into LTO issues: lto1: internal compiler error: in lto_get_pickled_tree, at lto-streamer-in.c I am guessing this is because the debug info for the type does not match the new range I have set for it. Is there a *right* way to update the range such that LTO doesn't get unhappy? (Maybe a cast with fold_convert_loc would be right?) pr44328.gcc4.5.fix.patch Description: Binary data
RE: Link error
Phung Nguyen wrote: > I am trying to build cross compiler for xc16x. I built successfully > binutils 2.18; gcc 4.0 and newlib 1.18. Everything is fine when > compiling a simple C file without any library call. It is also fine > when making a simple call to printf like printf("Hello world"). > However, i got error message from linker when call printf("i=%i",i); I don't know the answer, but I think you are more likely to get one if you post to gcc-h...@gcc.gnu.org. The gcc@gcc.gnu.org list is for people developing gcc, rather than only building or using it. I hope you find your answer soon. Best regards, Ian
Shouldn't unsafe-math-optimizations (re-)enable fp-contract=fast?
Hi there, I see in common.opt that fp-contract=fast is the default for GCC. But then it gets disabled in c-family/c-opts.c if you are using ISO C (e.g. with -std=c99). But surely if you have also specified -funsafe-math-optimizations then it should flip it back onto fast? I see in gcc/opts.c that a few other flags are triggered based on -funsafe-math-optimizations but not -ffp-contract=fast. I think this is a bug and could lead to less optimal code than people would like if they want c99 compliance for reason A but also unsafe math for reason B. What do you think? Cheers, Ian
RE: Shouldn't unsafe-math-optimizations (re-)enable fp-contract=fast?
> -Original Message- > From: Joseph S. Myers > On Thu, 6 Mar 2014, Ian Bolton wrote: > > > I see in common.opt that fp-contract=fast is the default for GCC. > > > > But then it gets disabled in c-family/c-opts.c if you are using ISO C > > (e.g. with -std=c99). > > > > But surely if you have also specified -funsafe-math-optimizations > then > > it should flip it back onto fast? > > That seems reasonable. > Thanks for the feedback, Joseph. I've been working on the patch, but I see in gcc/opts.c that there is a definite distinction between set_fast_math_flags and set_unsafe_math_optimizations_flags. I'm thinking this is more of a fast-math thing than an unsafe_math_optimizations thing, so I should actually be adding it in set_fast_math_flags. Is this correct? Whilst I'm on, I think I found a bug in the Optimize-Options documentation ... It says that -ffast-math is enabled by -Ofast and that -ffast-math enabled -funsafe-math-optimizations. But the definition of -funsafe-math-optimizations says it is not turned on by any -O option. I think it should say that it is turned on by -Ofast (via -ffast-math), just for clarity and consistency. Does that make sense? Many thanks, Ian > -- > Joseph S. Myers > jos...@codesourcery.com
RE: Shouldn't unsafe-math-optimizations (re-)enable fp-contract=fast?
> Thanks for the feedback, Joseph. > > I've been working on the patch, but I see in gcc/opts.c that there is > a definite distinction between set_fast_math_flags and > set_unsafe_math_optimizations_flags. I'm thinking this is more of a > fast-math thing than an unsafe_math_optimizations thing, so I should > actually be adding it in set_fast_math_flags. > > Is this correct? Scratch that. I went ahead and tried to set it in opts.c, but the c-opts.c override (to OFF) will currently happen in every case UNLESS you explicitly asked for -ffp-contract=fast. I therefore think the better solution is now to not override it in c-opts if unsafe-math-optimizations (which itself is enabled by -ffast-math and/or -Ofast) is enabled. We already have fp-contract=fast on by default, so we should just stop it being turned off unnecessarily instead. I've successfully coded this one and it works as expected when compiled with "-Ofast -std=c99". I'll prepare the patch now. Cheers, Ian