GCC driver to "Compile twice, score the assembly, choose the best"?

2014-05-15 Thread Ian Bolton
Hi, fellow GCC developers!

I was wondering if the "gcc" driver could be made to invoke
"cc1" twice, with different flags, and then just keep the
better of the two .s files that comes out?

I'm sure this is not a new idea, but I'm not aware of
anything being done in this area, so I've made this post to
gather your views. :)

The kinds of flags I am thinking could be toggled are
register allocation and instruction scheduling ones, since
it's very hard to find one-size-fits-all there and we don't
really want to have the user depend on knowing the right
one.

Obviously, compilation time will go up, but the run-time
benefits could be huge.

What are your thoughts?  What work in this area have I
failed to dig up in my limited research?

Many thanks,
Ian






RE: GCC driver to "Compile twice, score the assembly, choose the best"?

2014-05-15 Thread Ian Bolton
Thanks for the quick response.

> On Thu, May 15, 2014 at 1:46 PM, Ian Bolton  wrote:
> > Hi, fellow GCC developers!
> >
> > I was wondering if the "gcc" driver could be made to invoke
> > "cc1" twice, with different flags, and then just keep the
> > better of the two .s files that comes out?
> 
> I'd be interested in your .s comparison tool that decides which one
> is better!
> 

Well, yes, that's the hard part and it could be done a number of ways.

I think it would make a good contest actually, or summer coding
project.  Or at least a nice subject for brainstorming at the GNU
Cauldron. :)

Cheers,
Ian





RE: Using particular register class (like floating point registers) as spill register class

2014-05-16 Thread Ian Bolton
> On 05/16/2014 12:05 PM, Kugan wrote:
> >
> >
> > On 16/05/14 20:40, pins...@gmail.com wrote:
> >>
> >>
> >>> On May 16, 2014, at 3:23 AM, Kugan
>  wrote:
> >>>
> >>> I would like to know if there is anyway we can use registers from
> >>> particular register class just as spill registers (in places where
> >>> register allocator would normally spill to stack and nothing more),
> when
> >>> it can be useful.
> >>>
> >>> In AArch64, in some cases, compiling with -mgeneral-regs-only
> produces
> >>> better performance compared not using it. The difference here is
> that
> >>> when -mgeneral-regs-only is not used, floating point register are
> also
> >>> used in register allocation. Then IRA/LRA has to move them to core
> >>> registers before performing operations as shown below.
> >>
> >> Can you show the code with fp register disabled?  Does it use the
> stack to spill?  Normally this is due to register to register class
> costs compared to register to memory move cost.  Also I think it
> depends on the processor rather the target.  For thunder, using the fp
> registers might actually be better than using the stack depending if
> the stack was in L1.
> > Not all the LDR/STR combination match to fmov. In the testcase I
> have,
> >
> > aarch64-none-linux-gnu-gcc sha_dgst.c -O2  -S  -mgeneral-regs-only
> > grep -c "ldr" sha_dgst.s
> > 50
> > grep -c "str" sha_dgst.s
> > 42
> > grep -c "fmov" sha_dgst.s
> > 0
> >
> > aarch64-none-linux-gnu-gcc sha_dgst.c -O2  -S
> > grep -c "ldr" sha_dgst.s
> > 42
> > grep -c "str" sha_dgst.s
> > 31
> > grep -c "fmov" sha_dgst.s
> > 105
> >
> > I  am not saying that we shouldn't use floating point register here.
> But
> > from the above, it seems like register allocator is using it as more
> > like core register (even though the cost mode has higher cost) and
> then
> > moving the values to core registers before operations. if that is the
> > case, my question is, how do we just make this as spill register
> class
> > so that we will replace ldr/str with equal number of fmov when it is
> > possible.
> 
> I'm also seeing stuff like this:
> 
> => 0x7fb72a0928  Thread*)+2500>:
> add   x21, x4, x21, lsl #3
> => 0x7fb72a092c  Thread*)+2504>:
> fmov  w2, s8
> => 0x7fb72a0930  Thread*)+2508>:
> str   w2, [x21,#88]
> 
> I guess GCC doesn't know how to store an SImode value in an FP register
> into
> memory?  This is  4.8.1.
> 

Please can you try that on trunk and report back.

Thanks,
Ian
 





RE: soft-fp functions support without using libgcc

2014-05-16 Thread Ian Bolton
> On Fri, May 16, 2014 at 6:34 AM, Sheheryar Zahoor Qazi
>  wrote:
> >
> > I am trying to provide soft-fp support to a an 18-bit soft-core
> > processor architecture at my university. But the problem is that
> > libgcc has not been cross-compiled for my target architecture and
> some
> > functions are missing so i cannot build libgcc.I believe soft-fp is
> > compiled in libgcc so i am usable to invoke soft-fp functions from
> > libgcc.
> > It is possible for me to provide soft-fp support without using
> libgcc.
> > How should i proceed in defining the functions? Any idea? And does
> any
> > archoitecture provide floating point support withoput using libgcc?
> 
> I'm sorry, I don't understand the premise of your question.  It is not
> necessary to build libgcc before building libgcc.  That would not make
> sense.  If you have a working compiler that is missing some functions
> provided by libgcc, that should be sufficient to build libgcc.

If you replace "cross-compiled" with "ported", I think it makes senses.
Can one provide soft-fp support without porting libgcc for their
architecture?

Cheers,
Ian





RE: Using particular register class (like floating point registers) as spill register class

2014-05-19 Thread Ian Bolton
> >
> > Please can you try that on trunk and report back.
> 
> OK, this is trunk, and I'm not longer seeing that happen.
> 
> However, I am seeing:
> 
>0x007fb76dc82c <+160>: adrpx25, 0x7fb7c8
>0x007fb76dc830 <+164>: add x25, x25, #0x480
>0x007fb76dc834 <+168>: fmovd8, x0
>0x007fb76dc838 <+172>: add x0, x29, #0x160
>0x007fb76dc83c <+176>: fmovd9, x0
>0x007fb76dc840 <+180>: add x0, x29, #0xd8
>0x007fb76dc844 <+184>: fmovd10, x0
>0x007fb76dc848 <+188>: add x0, x29, #0xf8
>0x007fb76dc84c <+192>: fmovd11, x0
> 
> followed later by:
> 
>0x007fb76dd224 <+2712>:fmovx0, d9
>0x007fb76dd228 <+2716>:add x6, x29, #0x118
>0x007fb76dd22c <+2720>:str x20, [x0,w27,sxtw #3]
>0x007fb76dd230 <+2724>:fmovx0, d10
>0x007fb76dd234 <+2728>:str w28, [x0,w27,sxtw #2]
>0x007fb76dd238 <+2732>:fmovx0, d11
>0x007fb76dd23c <+2736>:str w19, [x0,w27,sxtw #2]
> 
> which seems a bit suboptimal, given that these double registers now
> have
> to be saved in the prologue.
> 

Thanks for doing that.  Many AArch64 improvements have gone in since
4.8 was released.

I think we'd have to see the output for the whole function to
determine whether that code is sane. I don't suppose the source
code is shareable or you have a testcase for this you can share?

Cheers,
Ian





Using -save-temps and @file should also save the intermediate @file used by the driver?

2011-05-10 Thread Ian Bolton
Does anyone have some thoughts they'd like to share on this:

"When you compile anything using @file support, the driver assumes @file
(at_file_supplied is true) is allowed and may pass options to the linker via
@file using a *temporary* file.

When -save-temps is also used, the temporary @file passed to the linker
should
also be saved.

Saving the temporary @file passed to the linker allows a developer to re-run
just the collect2/ld command.

On trunk this means that gcc/gcc.c (create_at_file) should honour the
save_temps_flag, saving the temporary @file for later analysis or use."

From: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44273







Understanding IRA

2009-10-16 Thread Ian Bolton
Hi Vladimir,

I have just begun working for Icera Inc, on a private port of GCC,
and my first task has been to investigate the performance impact of
using the Chaitin-Briggs algorithm of IRA vs the priority algorithm
that we currently have enabled in our production compiler (due to not
defining IRA_COVER_CLASSES).

After running various tests, I measured little performance changes
except for some test cases that have regressed quite considerably.
My suspicion was that these regressions were due to our architecture
having some instructions that can only use a subset of the register
bank, so I set about trying to understand IRA enough to confirm this.

I now believe I have now reached a point where I understand IRA
enough to start thinking about how I can improve these regressed
cases, but I thought it might be a good idea to formulate an email
with my current understanding of IRA in it, so you could correct any
errors before I continue.  (I figure this exchange between us will
also serve useful to other newbies in the field of GCC register
allocation!)

Without further ado, here is my current understanding of IRA
(numbered for ease of reference).

1. You can enable full IRA, with Chaitin-Briggs coloring, by defining
IRA_COVER_CLASSES, but you can just as easily disable it again by
supplying -fira-algorithm=priority at the command line.  (The latter
is useful to know because it means you can compare two versions
without recompiling.)

2. The priority algorithm allocates in an order determined by the
allocno_priority_compare_func and calls into assign_hard_reg in that
order to determine which allocno gets what register (or memory).

3. The CB algorithm puts allocnos into colored and uncolored buckets,
according to conflicts between allocnos.  The more conflicts, the
more chance of being put in the uncolored bucket.  Being in the
uncolored bucket increases an allocno's chance of being spilled;
being in the colored bucket guarantees that it will get a hard
register.

4. When you run at -O0, the conflicts aspect of IRA is disabled
(ira_conflict_p is set to false), and this subsequently causes the
fast_allocation function to be used instead of the do_coloring IRA
function.  (This makes sense since you cannot separate into buckets
without inspecting conflicts.)

5. Assuming ira_conflicts_p is enabled and you are using the CB
algorithm, the two buckets are sorted using a
bucket_allocno_compare_func, and this ordering is important, since it
dictates the order each allocno will be pushed onto the stack and
hence the order it will be popped from it (and assign_hard_reg
called).

6. Allocnos from the colorable bucket always go on the stack first
(in the order they were sorted) and then uncolorable ones, in an
order based on which is the best candidate for spilling.  Each time
an allocno is added to the stack, conflicts may be reduced and there
is the possibility of moving one or more allocnos from the
uncolorable bucket to the colorable one (so they end up being added
to the stack sooner than the other uncolorable ones).  Near the end
of processing the uncolorable bucket, there comes an opportunity to
move all remaining allocnos to the colorable bucket (due to few
remaining conflicts) and these get put on the stack last, and hence
popped first, and so they all get registers. These final uncolorables
will be those that had the highest spill cost and/or the most
remaining conflicts.

7. The spill cost and priority calculation for determining which
uncolorable to put on the stack first is important, for two main
reasons: firstly, whichever allocno is picked is a more likely
candidate for spilling; secondly, it will free other allocnos up to
be added to the nicer colored bucket, from which an allocno always
receives a hard register.

If you could let me know if and where I've gone wrong with any of the
above, I would be extremely grateful.  Then, once we are on the same
page, I hope you could also make some suggestions as to how I might
help IRA work well with our instructions that can only use a subset
of the register bank.

Best regards, Ian Bolton (Compiler Engineer, Icera Inc.)


RE: Understanding IRA

2009-10-19 Thread Ian Bolton
Hi Jeff and Vladimir.

Jeff: I'd be interested in trying the patch if you can send it my way.

Vladimir: Today I have seen reasonable improvements on
progressively-trimmed-down versions of a *single* test case by applying
some cost adjustments in the case when an operand wants one of our
bottom registers and IRA is considering giving it an alternative
register or memory.  I will be experimenting with different values for
these two new costs and some more test cases tomorrow.

I wonder if the optimal register allocator will just end up being one
that tries all the algorithms and picks the best output for a given
case?  When it comes to optimisation, I prefer to bend the rules if at
all possible!

Best regards,
Ian

-Original Message-
From: Jeff Law [mailto:l...@redhat.com] 
Sent: 16 October 2009 16:45
To: Vladimir Makarov
Cc: Ian Bolton; gcc@gcc.gnu.org
Subject: Re: Understanding IRA

On 10/16/09 08:53, Vladimir Makarov wrote:
>
> The biggest problem of GCC RA is still reload pass.  It frequently 
> changes decisions of IRA not in a good way.
Agreed.  Not only may reload make a bad choice, it's horribly 
unpredictable.  Trivial changes often lead to drastically different 
reloading decisions which in turn drastically change the final output.

One of my favorites right now is the round-robin selection of spill 
registers to encourage reload inheritance.  While I certainly understand

the reasoning behind the code, it's amazingly frustrating to watch 
reload choose the worst possible spill register simply because of the 
round-robin selection.

I've got a little hack in the reload-v2 branch which queries IRA to 
mitigate this problem, but it's merely a short-term hack until I can 
make reload inheritance go away.

jeff




RE: Understanding IRA

2009-11-03 Thread Ian Bolton
Hi again, Vladimir,

I am pleased to report some performance improvements after altering
ira-costs.c.  A key benchmark for us has improved by 5%.

Specifically, in record_reg_classes(), after the alt_cost has been
calculated and it will be applied to pp->mem_cost and pp->cost[k], I
check whether this particular operand wanted one of our BOTTOM_REGS
(r0-r15) and I further increase the pp->mem_cost by an arbitrary
amount and also increase pp->cost[k] by an arbitrary amount if k
does not represent the BOTTOM_REGS class.  My aim here is to nudge
IRA in the right direction for operands that just want BOTTOM_REGS.

After experimenting with different values for my "arbitrary
amounts", I discovered some that successfully made IRA more likely
to give BOTTOM_REGS to those instructions/operands that want
BOTTOM_REGS, since any other regs and memory ended up with high
enough costs for IRA to try and avoid using them.

I have included a snippet from my version of record_reg_classes()
below:


op_cost_add = alt_cost * frequency;
/* Finally, update the costs with the information we've
   calculated about this alternative.  */
for (i = 0; i < n_ops; i++)
  if (REG_P (ops[i]) && REGNO (ops[i]) >= FIRST_PSEUDO_REGISTER)
{
  struct costs *pp = op_costs[i], *qq = this_op_costs[i];
  int scale = 1 + (recog_data.operand_type[i] == OP_INOUT);

  /* If this operand really wanted a BOTTOM_REG, add an extra
 cost onto memory to nudge IRA away from putting it in
 memory */
  if (allocno_pref &&
  allocno_pref[ALLOCNO_NUM(ira_curr_regno_allocno_map
   [REGNO (ops[i])])]
  == BOTTOM_REGS)
{
  pp->mem_cost = MIN (pp->mem_cost,
  (qq->mem_cost + op_cost_add +
 (flag_ira_preferred_register_cost_memory * frequency))
  * scale);
}
  else
{
  pp->mem_cost = MIN (pp->mem_cost,
  (qq->mem_cost + op_cost_add)
  * scale);
}

for (k = 0; k < cost_classes_num; k++)
{
  /* If this operand really wanted a BOTTOM_REG, add an
 extra cost onto any register class that isn't 
 BOTTOM_REGS to nudge IRA away from putting it in a
 hard register of that class */
  if (allocno_pref &&
  allocno_pref[ALLOCNO_NUM(ira_curr_regno_allocno_map

   [REGNO (ops[i])])]
  == BOTTOM_REGS)
  {  
switch(cost_classes[k])
  {
case BOTTOM_REGS:
  op_cost_add = alt_cost * frequency;
  break;
case TOP_CREGS:
case C_REGS:
  op_cost_add = (alt_cost +
 flag_ira_preferred_register_cost_register)
 * frequency;
  break;
default:
  op_cost_add = alt_cost * frequency;
  break;
  }
  }
 
  pp->cost[k] = MIN (pp->cost[k],
 (qq->cost[k] + op_cost_add)
 * scale);
}
 }


So far, I have found the best value for
flag_ira_preferred_register_cost_memory to be 20 and the best value
for flag_ira_preferred_register_cost_register to be 6.  I appreciate
that these numbers do not really correlate with the other cost units
but they were the ones that made the impact. 

In terms of coloring algorithms, we are still seeing better
performance with the priority algorithm on our benchmarks, but the
cost adjustments above improved both priority algorithm and the CB
algorithm, with ira-region=mixed and ira-region=one.

If you have any thoughts you'd like to share then I'd definitely be
interested, but this post is mainly because you said in a previous
email that you wanted to hear my suggestions :)

Best regards,
Ian


>Ian Bolton wrote:
>>  I hope you could also make some suggestions as to how I might
>>  help IRA work well with our instructions that can only use a
>>  subset of the register bank.
>>   
>I forgot to write: thanks,  it would be interesting for me to see
>your suggestions :)


RE: Understanding IRA

2009-11-04 Thread Ian Bolton
Hi Jeff,

From an empirical perspective, the value of your patch is hard to
determine at this stage - one benchmark improved about 0.5% but others
have marginally regressed.

From an intellectual perspective, however, your patch is clearly a Good
Thing.  If I am understanding correctly, your aim is to prevent cases
where reload evicts a pseudo from a register that conflicts with the
pseudo that we are trying to reload, thereby undoing the clever cost-
based logic in IRA that gave the register to the current owner in the
first place?

My guess is that the performance drop could be attributed to reload
being lucky in some cases and your patch is preventing this luck from
happening.  Whilst I'm more of a risk-taker by nature, my employer
would prefer more predictable behaviour from its compiler, so we will
likely commit your patch to our development branch. Thanks very much!

Is there an ETA on when reload will be gone away? ;-)

Best regards,
Ian
 
> -Original Message-
> From: Jeff Law [mailto:l...@redhat.com]
> Sent: 22 October 2009 23:05
> To: Ian Bolton
> Cc: Vladimir Makarov; gcc@gcc.gnu.org
> Subject: Re: Understanding IRA
> 
> On 10/19/09 12:30, Ian Bolton wrote:
> > Hi Jeff and Vladimir.
> >
> > Jeff: I'd be interested in trying the patch if you can send it my
> way.
> >
> It's nothing special.
> 
> /* Return nonzero if REGNO is a particularly bad choice for reloading
> X.  */
> static int
> ira_bad_reload_regno_1 (int regno, rtx x)
> {
>int x_regno;
>ira_allocno_t a;
>enum reg_class pref;
> 
>/* We only deal with pseudo regs.  */
>if (! x || GET_CODE (x) != REG)
>  return 0;
> 
>x_regno = REGNO (x);
>if (x_regno < FIRST_PSEUDO_REGISTER)
>  return 0;
> 
>/* If the pseudo prefers REGNO explicitly, then do not consider
>   REGNO a bad spill choice.  */
>pref = reg_preferred_class (x_regno);
>if (reg_class_size[pref] == 1
> && TEST_HARD_REG_BIT (reg_class_contents[pref], regno))
>  return 0;
> 
>/* If the pseudo conflicts with REGNO, then we consider REGNO a
>   poor choice for a reload regno.  */
>a = ira_regno_allocno_map[x_regno];
>if (TEST_HARD_REG_BIT (ALLOCNO_TOTAL_CONFLICT_HARD_REGS (a),
regno))
>  return 1;
> 
>return 0;
> }
> 
> /* Return nonzero if REGNO is a particularly bad choice for reloading
> IN or OUT.  */
> int
> ira_bad_reload_regno (int regno, rtx in, rtx out)
> {
>return (ira_bad_reload_regno_1 (regno, in)
>|| ira_bad_reload_regno_1 (regno, out));
> }
> 
> Then change the loop in allocate_reload_reg to iterate 3 times intead
> of
> 2.  And add this fragment
> 
> 
> 
>   if (pass == 1
> && ira_bad_reload_regno (regnum, rld[r].in, rld[r].out))
>  continue;
> 
> 
> To body of hte conditional starting with
> 
> if ((reload_reg_free_p ...
> 
> 
> It's really just a hack.  I don't want to spend much time on that
code
> as ultimately I want it all to go away.
> 
> Jeff
> 
> 
> 



RE: Understanding IRA

2009-11-09 Thread Ian Bolton
Dave Hudson wrote:
> On Thu, 2009-11-05 at 18:05 +0000, Ian Bolton wrote:
> > I think I may have made a breakthrough!
> >
> > As mentioned above, IRA is correctly increasing the cost for TOP_REGS
> > when an allocno in region 1 is being used in one of our special
> > instructions that needs BOTTOM_REGS.  We can also see that IRA is
> > passing that allocno cost up to the associated allocno in region 0
> and
> > altering the total_cost for that allocno.
> >
> > However, it does not appear as though IRA is doing anything with the
> > total_costs, apart from using it to determine the preferred class for
> > the allocno.  When the main logic of the algorithms comes into play,
> > we only look at allocno_costs, which do not take into account the
> > future usage of that register in the allocno in region 1.
> >
> > I believe that if the decisions were made based on total_costs then I
> > would get a better allocation with existing IRA.  Before I experiment
> > with this, I was wondering what the motivation is for only looking
> > at allocno_costs?
> >
> > BTW, I did look into using the ! and * constraints, but I don't think
> > they could help.  Our architecture is like Motorola 68k in that we
> have
> > some instructions that can use all the registers and some that can
> > only use the subset (BOTTOM_REGS).  The latter type can only specify
> > "b" for their constraint, since they can only go in BOTTOM_REGS.  The
> > former type might benefit from "b,!t", so we are more likely to get
> > things in BOTTOM_REGS for when they are later needed there, but the
> > flip-side is that we might also fill up BOTTOM_REGS when actually
> there
> > was no subsequent need for the value to be in BOTTOM_REGS.  I may
> have
> > misunderstood how all this works, but it looks like constraints will
> > not help here and, in fact, the total_costs calculations that I
> > mention above are precisely the way IRA passes information about
> > register requirements upwards.
> 
> I've been working on gcc for an ISA (Ubicom32) that seems to have some
> similarities to the problem you're seeing (we have some regs that can
> be
> used for many things but not all) and was seeing a ton a pointless
> moves
> introduced by reload.
> 
> I've ended up trying quite a few different strategies including
> defining
> different cover classes and the strategy we're now using is to leave
> the
> cover classes the same way you have them but found that the most
> critical thing was to ensure that REGNO_REG_CLASS was returning a
> minimal class correctly.  Once I had this right then IRA started to do
> the right thing most of the time (-Os still isn't great but -O2 usually
> looks very good now).  We still see some problems but they're usually a
> result of reload messing things up or suboptimal handling of
> auto-incrementing in MEMs.

Thanks for the info, Dave.

Given that C_REGS = r0-r31, BOTTOM_REGS = r0-r15, TOP_CREGS = r16-r31, 
when you say "minimal class", does that mean that I should change my
macro from this:

/* Map a register number to a register class.  */
#define REGNO_REG_CLASS(REGNO)  \
  (BOTTOM_C_REG_P (REGNO) ? BOTTOM_REGS :   \
   (REGNO) == LINK_POINTER_REGNUM ? LINK_REGS : \
   C_REG_P (REGNO) ? C_REGS :   \
   D_REG_P (REGNO) ? D_REGS :   \
   CSR_REG_P (REGNO) ? CSR_REGS :   \
   (unsigned)(REGNO) < FIRST_PSEUDO_REGISTER ? INTERNAL_REGS : ALL_REGS)

to this (see C_REG_P line for change):

/* Map a register number to a register class.  */
#define REGNO_REG_CLASS(REGNO)  \
  (BOTTOM_C_REG_P (REGNO) ? BOTTOM_REGS :   \
   (REGNO) == LINK_POINTER_REGNUM ? LINK_REGS : \
   C_REG_P (REGNO) ? TOP_CREGS :\
   D_REG_P (REGNO) ? D_REGS :   \
   CSR_REG_P (REGNO) ? CSR_REGS :   \
   (unsigned)(REGNO) < FIRST_PSEUDO_REGISTER ? INTERNAL_REGS : ALL_REGS)

I made the change and nothing noticeable happened, but maybe once I fix
whatever else needs fixing then the second version of the macro will be
better?


RE: Understanding IRA

2009-11-10 Thread Ian Bolton
> On 11/06/09 05:53, Dave Hudson wrote:
> >   the most
> > critical thing was to ensure that REGNO_REG_CLASS was returning a
> > minimal class correctly.
> I believe that's been documented as the right thing to do for about 15
> years :-)   So, yes, you definitely want REGNO_REG_CLASS to return the
> smallest class to which a register belongs.
> 
> Jeff

So I should definitely change my function then!  If something isn't
BOTTOM_REGS and it is in C_REGS then the smallest class it's in is
actually TOP_CREGS.

BTW, today I have achieved some good results with existing IRA by doing
the following:

1) Changed the REG_ALLOC_ORDER so that TOP_CREGS are given out before
BOTTOM_REGS.  My previous hacked version worked by increasing the
priority of those that wanted BOTTOM_REGS, so they got first pick; this
new version makes them wait their turn, but ensures those with higher
priority take TOP_CREGS before BOTTOM_REGS.

The analogy, I think, is of giving out meals on an airplane.  Most
people will eat meat or vegetarian meals, whereas vegetarians only
want vegetarian meals.  My hacked version was effectively allowing
the vegetarians to push to the front of the queue and get their meals;
this new version works by encouraging the meat eaters to eat the meaty
meals first, leaving more suitable meals for the vegetarians further
down the plane.

2) I have forced propagation of allocnos to parent regions with the
following hack in find_allocno_class_costs():

{
  /* Propagate costs to upper levels in the region
 tree.  */
  parent_a_num = ALLOCNO_NUM (parent_a);
  for (k = 0; k < cost_classes_num; k++)
COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k]
  += COSTS_OF_ALLOCNO (total_costs, a_num)->cost[k];
  COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost
+= COSTS_OF_ALLOCNO (total_costs, a_num)->mem_cost;
  /* BEGIN IGB-IRA CHANGE 2 */
  /* Force total_costs to propagate upwards, by setting
 allocno_costs to be total_costs */
  for (k = 0; k < cost_classes_num; k++)
COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->cost[k]
  = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k];
COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->mem_cost
  = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost;
  /* END IGB-IRA CHANGE 2 */
}
 
I don't know why propagation isn't happening normally, but
this total_costs hack achieves the same thing for me at the
moment.  Is there any information I could provide to help
someone tell me why propagation isn't happening?


RE: Understanding IRA

2009-11-11 Thread Ian Bolton
Yesterday, I wrote:
> BTW, today I have achieved some good results with existing IRA by doing
> the following:
> 
> 1) Changed the REG_ALLOC_ORDER so that TOP_CREGS are given out before
> BOTTOM_REGS.  My previous hacked version worked by increasing the
> priority of those that wanted BOTTOM_REGS, so they got first pick; this
> new version makes them wait their turn, but ensures those with higher
> priority take TOP_CREGS before BOTTOM_REGS.
> 
> The analogy, I think, is of giving out meals on an airplane.  Most
> people will eat meat or vegetarian meals, whereas vegetarians only
> want vegetarian meals.  My hacked version was effectively allowing
> the vegetarians to push to the front of the queue and get their meals;
> this new version works by encouraging the meat eaters to eat the meaty
> meals first, leaving more suitable meals for the vegetarians further
> down the plane.
> 
> 2) I have forced propagation of allocnos to parent regions with the
> following hack in find_allocno_class_costs():
> 
> {
>   /* Propagate costs to upper levels in the region
>  tree.  */
>   parent_a_num = ALLOCNO_NUM (parent_a);
>   for (k = 0; k < cost_classes_num; k++)
> COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k]
>   += COSTS_OF_ALLOCNO (total_costs, a_num)->cost[k];
>   COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost
> += COSTS_OF_ALLOCNO (total_costs, a_num)->mem_cost;
>   /* BEGIN IGB-IRA CHANGE 2 */
>   /* Force total_costs to propagate upwards, by setting
>  allocno_costs to be total_costs */
>   for (k = 0; k < cost_classes_num; k++)
> COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->cost[k]
>   = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k];
> COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->mem_cost
>   = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost;
>   /* END IGB-IRA CHANGE 2 */
> }
> 
> I don't know why propagation isn't happening normally, but
> this total_costs hack achieves the same thing for me at the
> moment.  Is there any information I could provide to help
> someone tell me why propagation isn't happening?

Good news!  I have been able to remove my "total_costs" hack
above by instead commenting out the following line from ira_build()
in ira-build.c:

  remove_unnecessary_regions (false);

For my situation at least, this function is preventing the
propagation of useful allocno information from region 1 to
region 0.  Without my change, the allocation for a pseudo in
region 0 is not aware of how that pseudo will be used inside
a loop in region 1; the real impact of this is that we need
to then do a register move *inside the loop* into a valid
register class for the instruction in region 1.

I do not believe this will impact anyone with a regular
register set, but for any architecture where some instructions
can only use a subset of the register bank, I believe that
all regions are always necessary, since cost information
for each allocno is relevant and important.

I still need to do some more testing in regards this "fix",
but I wanted to put my findings out there as soon as possible
for comment from the experts.


RE: Understanding IRA

2009-11-11 Thread Ian Bolton
Vladimir Makarov wrote:
> Ian Bolton wrote:
> > Yesterday, I wrote:
> >
> >> BTW, today I have achieved some good results with existing IRA by
> doing
> >> the following:
> >>
> >> 1) Changed the REG_ALLOC_ORDER so that TOP_CREGS are given out
> before
> >> BOTTOM_REGS.  My previous hacked version worked by increasing the
> >> priority of those that wanted BOTTOM_REGS, so they got first pick;
> this
> >> new version makes them wait their turn, but ensures those with
> higher
> >> priority take TOP_CREGS before BOTTOM_REGS.
> >>
> >> The analogy, I think, is of giving out meals on an airplane.  Most
> >> people will eat meat or vegetarian meals, whereas vegetarians only
> >> want vegetarian meals.  My hacked version was effectively allowing
> >> the vegetarians to push to the front of the queue and get their
> meals;
> >> this new version works by encouraging the meat eaters to eat the
> meaty
> >> meals first, leaving more suitable meals for the vegetarians further
> >> down the plane.
> >>
> >> 2) I have forced propagation of allocnos to parent regions with the
> >> following hack in find_allocno_class_costs():
> >>
> >> {
> >>   /* Propagate costs to upper levels in the region
> >>  tree.  */
> >>   parent_a_num = ALLOCNO_NUM (parent_a);
> >>   for (k = 0; k < cost_classes_num; k++)
> >> COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k]
> >>   += COSTS_OF_ALLOCNO (total_costs, a_num)->cost[k];
> >>   COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost
> >> += COSTS_OF_ALLOCNO (total_costs, a_num)->mem_cost;
> >>   /* BEGIN IGB-IRA CHANGE 2 */
> >>   /* Force total_costs to propagate upwards, by setting
> >>  allocno_costs to be total_costs */
> >>   for (k = 0; k < cost_classes_num; k++)
> >> COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->cost[k]
> >>   = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k];
> >> COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->mem_cost
> >>   = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost;
> >>   /* END IGB-IRA CHANGE 2 */
> >> }
> >>
> >> I don't know why propagation isn't happening normally, but
> >> this total_costs hack achieves the same thing for me at the
> >> moment.  Is there any information I could provide to help
> >> someone tell me why propagation isn't happening?
> >>
> >
> > Good news!  I have been able to remove my "total_costs" hack
> > above by instead commenting out the following line from ira_build()
> > in ira-build.c:
> >
> >   remove_unnecessary_regions (false);
> >
> > For my situation at least, this function is preventing the
> > propagation of useful allocno information from region 1 to
> > region 0.  Without my change, the allocation for a pseudo in
> > region 0 is not aware of how that pseudo will be used inside
> > a loop in region 1; the real impact of this is that we need
> > to then do a register move *inside the loop* into a valid
> > register class for the instruction in region 1.
> >
> > I do not believe this will impact anyone with a regular
> > register set, but for any architecture where some instructions
> > can only use a subset of the register bank, I believe that
> > all regions are always necessary, since cost information
> > for each allocno is relevant and important.
> >
> > I still need to do some more testing in regards this "fix",
> > but I wanted to put my findings out there as soon as possible
> > for comment from the experts.
> >
> The function (remove_unnecessary_regions) is used to increase RA speed
> by decreasing number of regions.  The region is removed if the register
> pressure is not high in the region in other words if the probability to
> spill pseudos on the region border to improve RA in the region is
> small.
> 
> But when the region is removed and correspondingly allocnos (see
> function remove_unecessary_allocnos) in the region the info including
> hard register costs is propagated to the  corresponding allocnos in the
> parent region (see function propagate_some_info_from_allocno).
> 
> I've just checked the code it looks ok to me.  Ian, it would be nice if
> you figure out why the propagation does not happen.   As Jeff wrote,
> fixing that would be important practically for any target.

(I'm in the middle of a hefty compile at the moment, so I can&#x

RE: Understanding IRA

2009-11-11 Thread Ian Bolton
Ian Bolton wrote:
> Vladimir Makarov wrote:
> > Ian Bolton wrote:
> > > Yesterday, I wrote:
> > >
> > >> BTW, today I have achieved some good results with existing IRA by
> > doing
> > >> the following:
> > >>
> > >> 1) Changed the REG_ALLOC_ORDER so that TOP_CREGS are given out
> > before
> > >> BOTTOM_REGS.  My previous hacked version worked by increasing the
> > >> priority of those that wanted BOTTOM_REGS, so they got first pick;
> > this
> > >> new version makes them wait their turn, but ensures those with
> > higher
> > >> priority take TOP_CREGS before BOTTOM_REGS.
> > >>
> > >> The analogy, I think, is of giving out meals on an airplane.  Most
> > >> people will eat meat or vegetarian meals, whereas vegetarians only
> > >> want vegetarian meals.  My hacked version was effectively allowing
> > >> the vegetarians to push to the front of the queue and get their
> > meals;
> > >> this new version works by encouraging the meat eaters to eat the
> > meaty
> > >> meals first, leaving more suitable meals for the vegetarians
> further
> > >> down the plane.
> > >>
> > >> 2) I have forced propagation of allocnos to parent regions with
> the
> > >> following hack in find_allocno_class_costs():
> > >>
> > >> {
> > >>   /* Propagate costs to upper levels in the region
> > >>  tree.  */
> > >>   parent_a_num = ALLOCNO_NUM (parent_a);
> > >>   for (k = 0; k < cost_classes_num; k++)
> > >> COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k]
> > >>   += COSTS_OF_ALLOCNO (total_costs, a_num)->cost[k];
> > >>   COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost
> > >> += COSTS_OF_ALLOCNO (total_costs, a_num)->mem_cost;
> > >>   /* BEGIN IGB-IRA CHANGE 2 */
> > >>   /* Force total_costs to propagate upwards, by setting
> > >>  allocno_costs to be total_costs */
> > >>   for (k = 0; k < cost_classes_num; k++)
> > >> COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->cost[k]
> > >>   = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k];
> > >> COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->mem_cost
> > >>   = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost;
> > >>   /* END IGB-IRA CHANGE 2 */
> > >> }
> > >>
> > >> I don't know why propagation isn't happening normally, but
> > >> this total_costs hack achieves the same thing for me at the
> > >> moment.  Is there any information I could provide to help
> > >> someone tell me why propagation isn't happening?
> > >>
> > >
> > > Good news!  I have been able to remove my "total_costs" hack
> > > above by instead commenting out the following line from ira_build()
> > > in ira-build.c:
> > >
> > >   remove_unnecessary_regions (false);
> > >
> > > For my situation at least, this function is preventing the
> > > propagation of useful allocno information from region 1 to
> > > region 0.  Without my change, the allocation for a pseudo in
> > > region 0 is not aware of how that pseudo will be used inside
> > > a loop in region 1; the real impact of this is that we need
> > > to then do a register move *inside the loop* into a valid
> > > register class for the instruction in region 1.
> > >
> > > I do not believe this will impact anyone with a regular
> > > register set, but for any architecture where some instructions
> > > can only use a subset of the register bank, I believe that
> > > all regions are always necessary, since cost information
> > > for each allocno is relevant and important.
> > >
> > > I still need to do some more testing in regards this "fix",
> > > but I wanted to put my findings out there as soon as possible
> > > for comment from the experts.
> > >
> > The function (remove_unnecessary_regions) is used to increase RA
> speed
> > by decreasing number of regions.  The region is removed if the
> register
> > pressure is not high in the region in other words if the probability
> to
> > spill pseudos on the region border to improve RA in the region is
> > small.
> >
> > But when the region is removed and correspondingly allocnos (see
> > function remove_unecessar

RE: Understanding IRA

2009-11-16 Thread Ian Bolton
Ian Bolton wrote:
> Ian Bolton wrote:
> > Vladimir Makarov wrote:
> > > Ian Bolton wrote:
> > > > Yesterday, I wrote:
> > > >
> > > >> BTW, today I have achieved some good results with existing IRA
> by
> > > doing
> > > >> the following:
> > > >>
> > > >> 1) Changed the REG_ALLOC_ORDER so that TOP_CREGS are given out
> > > before
> > > >> BOTTOM_REGS.  My previous hacked version worked by increasing
> the
> > > >> priority of those that wanted BOTTOM_REGS, so they got first
> pick;
> > > this
> > > >> new version makes them wait their turn, but ensures those with
> > > higher
> > > >> priority take TOP_CREGS before BOTTOM_REGS.
> > > >>
> > > >> The analogy, I think, is of giving out meals on an airplane.
> Most
> > > >> people will eat meat or vegetarian meals, whereas vegetarians
> only
> > > >> want vegetarian meals.  My hacked version was effectively
> allowing
> > > >> the vegetarians to push to the front of the queue and get their
> > > meals;
> > > >> this new version works by encouraging the meat eaters to eat the
> > > meaty
> > > >> meals first, leaving more suitable meals for the vegetarians
> > further
> > > >> down the plane.
> > > >>
> > > >> 2) I have forced propagation of allocnos to parent regions with
> > the
> > > >> following hack in find_allocno_class_costs():
> > > >>
> > > >> {
> > > >>   /* Propagate costs to upper levels in the region
> > > >>  tree.  */
> > > >>   parent_a_num = ALLOCNO_NUM (parent_a);
> > > >>   for (k = 0; k < cost_classes_num; k++)
> > > >> COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k]
> > > >>   += COSTS_OF_ALLOCNO (total_costs, a_num)->cost[k];
> > > >>   COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost
> > > >> += COSTS_OF_ALLOCNO (total_costs, a_num)->mem_cost;
> > > >>   /* BEGIN IGB-IRA CHANGE 2 */
> > > >>   /* Force total_costs to propagate upwards, by setting
> > > >>  allocno_costs to be total_costs */
> > > >>   for (k = 0; k < cost_classes_num; k++)
> > > >> COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->cost[k]
> > > >>   = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k];
> > > >> COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->mem_cost
> > > >>   = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost;
> > > >>   /* END IGB-IRA CHANGE 2 */
> > > >> }
> > > >>
> > > >> I don't know why propagation isn't happening normally, but
> > > >> this total_costs hack achieves the same thing for me at the
> > > >> moment.  Is there any information I could provide to help
> > > >> someone tell me why propagation isn't happening?
> > > >>
> > > >
> > > > Good news!  I have been able to remove my "total_costs" hack
> > > > above by instead commenting out the following line from
> ira_build()
> > > > in ira-build.c:
> > > >
> > > >   remove_unnecessary_regions (false);
> > > >
> > > > For my situation at least, this function is preventing the
> > > > propagation of useful allocno information from region 1 to
> > > > region 0.  Without my change, the allocation for a pseudo in
> > > > region 0 is not aware of how that pseudo will be used inside
> > > > a loop in region 1; the real impact of this is that we need
> > > > to then do a register move *inside the loop* into a valid
> > > > register class for the instruction in region 1.
> > > >
> > > > I do not believe this will impact anyone with a regular
> > > > register set, but for any architecture where some instructions
> > > > can only use a subset of the register bank, I believe that
> > > > all regions are always necessary, since cost information
> > > > for each allocno is relevant and important.
> > > >
> > > > I still need to do some more testing in regards this "fix",
> > > > but I wanted to put my findings out there as soon as possible
> > > > for comment from the experts.
> > > >
> > > T

RE: Understanding IRA

2009-11-19 Thread Ian Bolton
Jeff Law wrote: 
> On 11/16/09 10:33, Ian Bolton wrote:
> > The question is: how to fix this?  I have initially put the
> REG_ALLOC_ORDER
> > back to how it was and changed the operand constraints in our MD
> file,
> > so each of the "apathetic" instructions, will either take a 't'
> (TOP_CREG)
> > or '?b' (BOTTOM_REG).  The '?' shows that this alternative is
> slightly more
> > costly than using 't'.  On the benchmark that benefitted the most
> from
> > the new REG_ALLOC_ORDER, these constraints are almost achieving the
> same
> > thing.  It is only "almost" there because I am struggling with how to
> show
> > two alternatives for loads and stores, which currently have an 'm'
> > constraint.
> >
> I'm not aware of any way to describe this to IRA.  In theory I guess
> IRA
> could be twiddled to use  TARGET_ADDRESS_COST to derive some kind of
> cost difference based on the registers used in the MEM, but it seems
> rather hackish.

I found somewhere in record_address_reg to achieve what I needed:

  for (k = 0; k < cost_classes_num; k++)
  {
i = cost_classes[k];
pp->cost[k]
  += (ira_get_may_move_cost (Pmode, i, rclass, true) * scale) / 2;

/* Slightly nudge memory addresses away from using BOTTOM_REGS and
   C_REGS, so they take TOP_CREGS instead - should this pseudo later
   need BOTTOM_REGS, there will be a higher cost to use TOP_CREGS
   and it will still get BOTTOM_REGS. This is equivalent to adding a
   ?b on each instruction that currently has a 'm' constraint.

   Writing this generically might look something like:

   pp->cost[k] += TARGET_ADDRESS_EXTRA_COST_P(cost_classes[k])
  ? (scale/2) : 0;
*/
if (cost_classes[k] == BOTTOM_REGS || cost_classes[k] == C_REGS)
  pp->cost[k] += (scale) / 2;
  }

I was then able to alter all our register-agnostic instructions in our
.md file to take either a 't' for TOP_CREGS for a '?b' for BOTTOM_REGS.
Initial results showed that IRA was moving input arguments out of their
BOTTOM_REGS (e.g. $c1) into TOP_CREGS to do work on them, since it
thought TOP_CREGS were less costly to use, despite the cost of the move
instruction to get the input argument into a TOP_CREG.

I wondered if this was because we add 2 to the alt_cost for each '?' and
our REG_MOVE_COST is also 2, but this becomes irrelevant anyway if you
do a lot of work with the input argument, since each potential use in a
BOTTOM_REG incurs some kind of penalty (one per '?' seen) which will
eventually persuade IRA that leaving the input argument in the
BOTTOM_REG it arrived in is more costly than moving it to a TOP_CREG.

I addressed this problem by splitting my register bank a little
differently: instead of making a distinction between BOTTOM_REGS and
TOP_CREGS, I made it so there was only a penalty if you used one of the
non-argument BOTTOM_REGS (i.e. a callee-save BOTTOM_REG).  This meant
that IRA was happy to leave input arguments in their BOTTOM_REGS but
erred towards using TOP_CREGS once the caller-save BOTTOM_REGS had run
out.  This was an improvement, but there was still a case where these
'?' penalties were not aligned with reality:

T1 = A + B; // can use any register, TOP_CREGS appears cheaper
T2 = A - C; // can use any register, TOP_CREGS appears cheaper
T3 = A & D; // must use BOTTOM_REGS

The constraints for the first two instructions show that TOP_CREGS is
cheaper, but then you have to plant a move to get A into a BOTTOM_REG
to do the AND; in reality, we know it cheaper to have A in a BOTTOM_REG
all along, but the '?' constraint suggests there is a cost in doing this
for the ADD and SUB and so IRA will put A in a TOP_CREG at first and
incur the cost of the move because it is still cheaper than the costs I
have defined in with my constraints.  I don't believe there is a way to
communicate a conditional cost, so I'm thinking that constraints are not
the solution for me at this time.  What are your thoughts?

> You might try something like this:
> 
>1. Crank up the callee-saved register cost adjustment in
> assign_hard_reg so that it's scaled based on REG_FREQ.  That will
> probably lead to some regressions based on my experiments.
> 
>2. Then add a check that completely avoids the cost adjustment in
> cases where we pushed a MAY_SPILL_P allocno.  This was on my todo list,
> but I haven't got to it yet.
> 
> If you wanted to get fancy, you could track the maximum number of
> neighbors in each class as allocnos are pushed and use that to adjust
> how many registers are cost adjusted in assign_hard_reg.  The idea
> being
> the more neighbors the allocno has, the  more callee-saved regsiters
> we

Worth balancing the tree before scheduling?

2009-11-20 Thread Ian Bolton
From some simple experiments (see below), it appears as though GCC aims
to
create a lop-sided tree when there are constants involved (func1 below),
but a balanced tree when there aren't (func2 below).

Our assumption is that GCC likes having constants all near to each other
to
aid with tree-based optimisations, but I'm fairly sure that, when it
comes
to scheduling, it would be better to have a balanced tree, so sched has
more
choices about what to schedule next?

The impact of limiting sched's options can be seen if we look at the
pseudo-assembly produced by GCC for our architecture:

func1:
LSHIFT  $c3,$c1,3 # tmp137, a,
ADD $c2,$c2,$c3   # tmp138, b, tmp137
ADD $c1,$c2,$c1   #, tmp138, a

We think it would be better to avoid using the temporary:

func1:
ADD $c2,$c1,$c2 # tmp137, a, b
LSHIFT  $c1,$c1,3   # tmp138, a,
ADD $c1,$c2,$c1 # , tmp137, tmp138

As it currently stands, sched doesn't have the option to do this because
its input (shown in func.c.172r.asmcons below) is arranged such that the
first add depends on the shift and the second add depends on the first
add.

If the tree were balanced, sched would have the option to do the add
first.
And, providing the logic was there in sched, we could make it choose to
schedule such that we limit the number of temporaries used.

Maybe one of the RTL passes prior to scheduling has the potential to
balance the tree/RTL, but just isn't on our architecture?

==
func.c:
--
int func1 (int a, int b)
{
  /* the original expression */
  return a + b + (a << 3);
}
 

int func2 (int a, int b, int c)
{
  /* the original expression */
  return a + b + (a << c);
}
 

==

==
func.c.129t.supress_extend:
--
;; Function func1 (func1)
 
func1 (int a, int b)
{
:
  return (b + (a << 3)) + a;
}

func2 (int a, int b, int c)
{
:
  return (b + a) + (a << c);
}

 
==

==
func.c.172r.asmcons:
--

;; Function func1 (func1)

;; Pred edge  ENTRY [100.0%]  (fallthru)
(note 5 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
 
(insn 2 5 3 2 func.c:2 (set (reg/v:SI 134 [ a ])
(reg:SI 1 $c1 [ a ])) 45 {*movsi} (expr_list:REG_DEAD (reg:SI 1
$c1 [ a ])
(nil)))
 
(note 3 2 4 2 NOTE_INSN_DELETED)
 
(note 4 3 7 2 NOTE_INSN_FUNCTION_BEG)
 
(insn 7 4 8 2 func.c:2 (set (reg:SI 137)
(ashift:SI (reg/v:SI 134 [ a ])
(const_int 3 [0x3]))) 80 {ashlsi3} (nil))
 
(insn 8 7 9 2 func.c:2 (set (reg:SI 138)
(plus:SI (reg:SI 2 $c2 [ b ])
(reg:SI 137))) 65 {*addsi3} (expr_list:REG_DEAD (reg:SI 137)
(expr_list:REG_DEAD (reg:SI 2 $c2 [ b ])
(nil
 

(note 9 8 14 2 NOTE_INSN_DELETED)
 

(insn 14 9 20 2 func.c:5 (set (reg/i:SI 1 $c1)
(plus:SI (reg:SI 138)
(reg/v:SI 134 [ a ]))) 65 {*addsi3} (expr_list:REG_DEAD
(reg:SI 138)
(expr_list:REG_DEAD (reg/v:SI 134 [ a ])
(nil
 

(insn 20 14 0 2 func.c:5 (use (reg/i:SI 1 $c1)) -1 (nil))

;; Function func2 (func2)

;; Pred edge  ENTRY [100.0%]  (fallthru)
(note 6 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
 

(insn 2 6 3 2 func.c:8 (set (reg/v:SI 134 [ a ])
(reg:SI 1 $c1 [ a ])) 45 {*movsi} (expr_list:REG_DEAD (reg:SI 1
$c1 [ a ])
(nil)))
 

(note 3 2 4 2 NOTE_INSN_DELETED)
 

(note 4 3 5 2 NOTE_INSN_DELETED)
 

(note 5 4 8 2 NOTE_INSN_FUNCTION_BEG)
 

(insn 8 5 9 2 func.c:8 (set (reg:SI 138)
(plus:SI (reg:SI 2 $c2 [ b ])
(reg/v:SI 134 [ a ]))) 65 {*addsi3} (expr_list:REG_DEAD
(reg:SI 2 $c2 [ b ])
(nil)))
 

(insn 9 8 10 2 func.c:8 (set (reg:SI 139)
(ashift:SI (reg/v:SI 134 [ a ])
(reg:SI 3 $c3 [ c ]))) 80 {ashlsi3} (expr_list:REG_DEAD
(reg/v:SI 134 [ a ])
(expr_list:REG_DEAD (reg:SI 3 $c3 [ c ])
(nil
 

(note 10 9 15 2 NOTE_INSN_DELETED)
 

(insn 15 10 21 2 func.c:11 (set (reg/i:SI 1 $c1)
(plus:SI (reg:SI 138)
(reg:SI 139))) 65 {*addsi3} (expr_list:REG_DEAD (reg:SI 139)
(expr_list:REG_DEAD (reg:SI 138)
(nil
 

(insn 21 15 0 2 func.c:11 (use (reg/i:SI 1 $c1)) -1 (nil))

==

Cheers,
Ian


RE: Worth balancing the tree before scheduling?

2009-11-23 Thread Ian Bolton
David Edelsohn wrote: 
> On Fri, Nov 20, 2009 at 10:05 AM, Ian Bolton 
> wrote:
> > From some simple experiments (see below), it appears as though GCC
> aims
> > to
> > create a lop-sided tree when there are constants involved (func1
> below),
> > but a balanced tree when there aren't (func2 below).
> >
> > Our assumption is that GCC likes having constants all near to each
> other
> > to
> > aid with tree-based optimisations, but I'm fairly sure that, when it
> > comes
> > to scheduling, it would be better to have a balanced tree, so sched
> has
> > more
> > choices about what to schedule next?
> 
> I think this would depend on the target architecture and instruction
> set: CISC vs RISC, many registers vs few registers, etc.  I do not
> believe that GCC intentionally is trying to optimize for either, but I
> do not think there is a single, right answer.
>
Regardless of the architecture, I can't see how an unbalanced tree would
ever be a good thing.  With a balanced tree, you can still choose to
process it in either direction (broad versus deep) - whichever is better
for your architecture - but, as far as I can see (bearing in mind that
I'm very new to GCC development!), a tall lop-sided tree gives few
scheduling options due to all the extra dependencies.  I guess I must
be missing something?

Regards,
Ian


RE: Understanding IRA

2009-12-02 Thread Ian Bolton
Jeff Law wrote:
> Ian Bolton wrote: 
> > Initial results showed that IRA was moving input arguments out of
> their
> > BOTTOM_REGS (e.g. $c1) into TOP_CREGS to do work on them, since it
> > thought TOP_CREGS were less costly to use, despite the cost of the
> move
> > instruction to get the input argument into a TOP_CREG.
> >
> That may indicate a cost scaling issue or more general weaknesses in
> IRA's cost modeling.
> 
> > I addressed this problem by splitting my register bank a little
> > differently: instead of making a distinction between BOTTOM_REGS and
> > TOP_CREGS, I made it so there was only a penalty if you used one of
> the
> > non-argument BOTTOM_REGS (i.e. a callee-save BOTTOM_REG).  This meant
> > that IRA was happy to leave input arguments in their BOTTOM_REGS but
> > erred towards using TOP_CREGS once the caller-save BOTTOM_REGS had
> run
> > out.  This was an improvement, but there was still a case where these
> > '?' penalties were not aligned with reality:
> >
> > T1 = A + B; // can use any register, TOP_CREGS appears cheaper
> > T2 = A - C; // can use any register, TOP_CREGS appears cheaper
> > T3 = A&  D; // must use BOTTOM_REGS
> >
> > The constraints for the first two instructions show that TOP_CREGS is
> > cheaper, but then you have to plant a move to get A into a BOTTOM_REG
> > to do the AND; in reality, we know it cheaper to have A in a
> BOTTOM_REG
> > all along, but the '?' constraint suggests there is a cost in doing
> this
> > for the ADD and SUB and so IRA will put A in a TOP_CREG at first and
> > incur the cost of the move because it is still cheaper than the costs
> I
> > have defined in with my constraints.  I don't believe there is a way
> to
> > communicate a conditional cost, so I'm thinking that constraints are
> not
> > the solution for me at this time.  What are your thoughts?
> >
> See above.  This might be a problem with scaling or weaknesses in IRA's
> costing model.   In theory IRA accumulates the cost of using each class
> over the set of insns using a particular pseudo.

I had an epiphany this morning and came up with an idea to achieve the
lookahead I thought I needed, thereby making the costs created by '?' a
lot more concrete and reliable.

Firstly, I have altered the alt_cost adjustment (for '?') in ira-costs.c,
so that it only happens on the second pass *and* only when the first pass
has determined that this allocno never needs a BOTTOM_REG.

The first condition (that it only occurs on the second pass) is there so
that the preferred class calculated for an allocno is based on hard
constraints, as opposed to the fuzzier constraints of '?'.  Without this
change, the second pass cannot use the preferred class to correctly add
costs for each class that doesn't intersect with the preferred class.

e.g. If an insn has an allocno as an operand that requires BOTTOM_REGS,
then we want the cost for TOP_CREGS for that allocno in that operand to
be higher to show the cost of moving it into BOTTOM_REGS.  But if we let
the '?' constraints dictate the pref class, and this allocno appears in
other insns where the '?' constraint has appeared, then TOP_CREGS may
end up being the preferred class and so this insn, which actually needs
BOTTOM_REGS for its operand, will end increasing the costs for any class
that doesn't intersect with TOP_CREGS (i.e. BOTTOM_REGS).

I'm thinking that this change will be generally useful.  What are your
thoughts?

The second condition is determined by me storing a new variable in each
allocno on the first pass to flag whether it ever appears as an operand
that must have BOTTOM_REGS.  On the second pass, I can then only penalise
an allocno for using my precious BOTTOM_REGS if I have already determined
that it will never need them.

This change is probably too specific to our case at the moment, but it
could probably be made generic, if it has use to other architectures.

These two changes together make the example I created above (where A
appears in multiple operations) correctly allocate a BOTTOM_REG from the
outset.  I am yet to run benchmarks with this solution, but I think it
will end up being similar to my alternate REG_ALLOC_ORDER, where I just
gave out TOP_CREGS first.

Sadly, I don't think either approach will handle the case where there is
low pressure and we first grab a TOP_CREG for something like an ADD (due
to our constraints telling us to) then we free up that register cos it's
not needed anymore and then we have to use a different (BOTTOM_REG) reg
for something like an AND; we should have just used just one BOTTOM_REG.

I guess solving this would involve bigger changes to the algorithm, so
that ea

RE: Understanding IRA

2009-12-07 Thread Ian Bolton
Jeff Law wrote:
> >
> > I had an epiphany this morning and came up with an idea to achieve
> the
> > lookahead I thought I needed, thereby making the costs created by '?'
> a
> > lot more concrete and reliable.
> >
> > Firstly, I have altered the alt_cost adjustment (for '?') in ira-
> costs.c,
> > so that it only happens on the second pass *and* only when the first
> pass
> > has determined that this allocno never needs a BOTTOM_REG.
> >
> > The first condition (that it only occurs on the second pass) is there
> so
> > that the preferred class calculated for an allocno is based on hard
> > constraints, as opposed to the fuzzier constraints of '?'.  Without
> this
> > change, the second pass cannot use the preferred class to correctly
> add
> > costs for each class that doesn't intersect with the preferred class.
> >
> > e.g. If an insn has an allocno as an operand that requires
> BOTTOM_REGS,
> > then we want the cost for TOP_CREGS for that allocno in that operand
> to
> > be higher to show the cost of moving it into BOTTOM_REGS.  But if we
> let
> > the '?' constraints dictate the pref class, and this allocno appears
> in
> > other insns where the '?' constraint has appeared, then TOP_CREGS may
> > end up being the preferred class and so this insn, which actually
> needs
> > BOTTOM_REGS for its operand, will end increasing the costs for any
> class
> > that doesn't intersect with TOP_CREGS (i.e. BOTTOM_REGS).
> >
> > I'm thinking that this change will be generally useful.  What are
> your
> > thoughts?
> >
> Step #1 seems a lot like adding '*' before the '?'.   The idea behind
> '*' was to make it possible to ignore a constraint modifier such as '?'
> or '!' when choosing register preferences.   I haven't looked at IRA's
> implementation of preferencing, but I suspect it mostly lifted the old
> allocator's preferencing code, so I'd expect we're still supporting
> '*'.

I just had a look through the code and '*' means the '?' is completely
ignored during costing.  A quick grep shows me that '?' would then only
have meaning in reload.c, reload1.c and postreload.c.

This is the code I have from record_reg_classes() in ira-costs.c:

case '*':
  /* Ignore the next letter for this pass.  */
  c = *++p;
  break;

The comment suggests it will be ignored in "this pass", but the code
makes it get ignored in both passes during ira-costs.c.  Maybe this is a
bug?  Perhaps it should read:

case '*':
  /* Ignore the next letter for this pass.  */
  if (!allocno_pref)
c = *++p;
  break;

If I changed all my '?' constraints to be '*?' then the above
change would achieve what my hack does.

> > The second condition is determined by me storing a new variable in
> each
> > allocno on the first pass to flag whether it ever appears as an
> operand
> > that must have BOTTOM_REGS.  On the second pass, I can then only
> penalise
> > an allocno for using my precious BOTTOM_REGS if I have already
> determined
> > that it will never need them.
> >
> > This change is probably too specific to our case at the moment, but
> it
> > could probably be made generic, if it has use to other architectures.
> >
> I think we already have code to do something like this via
> CONFLICT_HARD_REGNO_COSTS.  Though I must admit, I haven't entirely
> wrapped my head around those costs and how they're used yet.

I was hoping I would be able to avoid all the conflict-related stuff, but
this may be where I have to go next, although I am considering returning
to my elegant solution of just giving out TOP_REGS first unless someone
specifically needs BOTTOM_REGS.  This does perform slightly worse for low
reg pressure cases (due to the case I mention below), but is the best
solution I have come up with for the high pressure cases.

> > Sadly, I don't think either approach will handle the case where there
> is
> > low pressure and we first grab a TOP_CREG for something like an ADD
> (due
> > to our constraints telling us to) then we free up that register cos
> it's
> > not needed anymore and then we have to use a different (BOTTOM_REG)
> reg
> > for something like an AND; we should have just used just one
> BOTTOM_REG.
> >
> That would actually be something helped by the patch we were discussing
> in a prior message.  The only trick is the two allocnos would have to
> be
> connected by conflicting with some 3rd allocno.  The 3rd allocno may
> not
> be a high degree, but the heuristic will still trigger and slightly
> encourage the ADD and AND allocnos to share the same reg.

Yeah, but how do I ensure that the one that needs the BOTTOM_REG gets
colored first, thereby leaving its register free to be re-used?  This
little tweak only kicks in when you are coloring the first of the non-
conflicting pair, right?  If the agnostic allocno is colored first, it
will grab the TOP_REG, thereby causing your tweak to slightly decrem

RE: Understanding IRA (The Story so far)

2009-12-14 Thread Ian Bolton
For those who have been following my adventure in the world of IRA, or
just
want to take advantage of some of the time I've spent on this, here's a
detailed recap of my exploits and my current status (which I'm very
happy
with):

I initially thought IRA was calculating costs wrongly for our
architecture
and so I experimented with adding more costs in different places.  This
effectively raised the priority of coloring certain allocnos and it led
to
good performance on some benchmarks, but it was really a complete hack
and
not an acceptable solution, so I went back to the drawing board.

After chatting with Vlad and Jeff, I realised IRA was calculating costs
correctly and the issue was that the costs only reflect the preferences
of
each allocno, as opposed to their "agnosticism" (i.e. not caring what
register they get), and so instructions that will happily take any
register
would take whichever one you showed them first; this meant that high
priority
agnostic instructions (that are colored first) were pinching BOTTOM_REGS
before the lower-priority special instructions got a chance.  To address
this, I came up with the simple solution of changing the
REG_ALLOC_ORDER,
where I show TOP_CREGS (the top half of our register set) to agnostic
instructions before the callee-save BOTTOM_REGS (c7-c15) by default, so
they
take TOP_CREGS when all other costs are equal; meanwhile, allocnos
requiring
BOTTOM_REGS were given them because they'd been left unallocated and
their
costs were lower than the TOP_CREGS.

This resulted in good performance improvements on some key benchmarks,
where
previously agnostic instructions were unwittingly using up these special
BOTTOM_REGS and leading to us allocating TOP_CREGS for operands in
special
instructions (which then required spills and moves to get a BOTTOM_REG
in
time to do the special instruction).  The main drawback of this
alternate
REG_ALLOC_ORDER was that we reduce the chance to reuse a callee-save
register
if we pick a TOP_CREG, because it doesn't work on all instructions, and
I
ended up getting regressions on some low-pressure benchmarks because we
used
more callee-save registers than previously.

I then tried various other ways of achieving the performance
improvements I
got on the high register pressure benchmarks and eliminating the
regressions
on the low-pressure ones:

* I tried using '?' constraints on agnostic instructions, such that they
erred to TOP_CREGS instead of callee-save BOTTOM_REGS, but this led to
too
much bias away from the more reusable BOTTOM_REGS (even when there was
no
need to avoid them) and I subsequently got regressions;

* I tried restricting the effect of the '?', by not using it for the
register
preferencing pass and by only adding on the alt_cost if I'd determined
the
allocno never requires BOTTOM_REG, and this got some decent performance,
but
I was still effectively penalising agnostic instructions for using
BOTTOM_REGS regardless of whether others allocnos were demanding them,
so I
got regressions;

* I tried coloring the ones that need BOTTOM_REGS first, which maximises
the
reuse opportunities, but this was essentially a rejection of the
existing
tried-and-tested coloring heuristics in IRA and consequently led to some
regressions;

* I tried combining the new REG_ALLOC_ORDER with a modified coloring
heuristic that uses the need for BOTTOM_REGS as a tie-breaker when two
allocnos are otherwise indistinguishable, but this was overriding the
ALLOCNO_NUM ordering (which I guess represents live range starting
points?),
so I got some regressions.

Fortunately, I was hit by another epiphany on Monday night and I
realised I
could emulate the new REG_ALLOC_ORDER dynamically within assign_hard_reg
and
only when necessary.  When allocating for an agnostic allocno, if it
conflicted with other allocnos that needed BOTTOM_REGS, I added 1 to the
full_cost of each of the BOTTOM_REGS so this allocno took a TOP_CREG.
This
gave me my best performance overall, but there were still some
regressions on
low-pressure benchmarks.  Whilst investigating them, I realised I had
gone
beyond my original new REG_ALLOC_ORDER because I was now avoiding all
BOTTOM_REGS instead of just the callee-save BOTTOM_REGS, so I changed
assign_hard_reg so it only adds 1 to each callee-save BOTTOM_REG.  This
means
we happily use up the caller-save BOTTOM_REGS (c1 - c6), regardless of
whether the allocno is agnostic or not, and this leads to more
reusability in
low-pressure cases (whereas avoiding all BOTTOM_REGS for an agnostic
instruction decreases reusability).

Note that I had been experimenting with Priority algorithm and
Chaitin-Briggs
up until this point and have decided to stick with Priority for now
because
we are using Priority already and this minimises the disruption and
reduces
the risk of regressions on untested inputs.  (Results generally show CB
to be
better performing and smaller code-size, but it suffers badly in very
high
pressure cases and spills more than Priority.)

A

Possible IRA improvements for irregular register architectures

2009-12-18 Thread Ian Bolton
Let's assume I have two sub-classes of ALL_REGS: BOTTOM_REGS (c0-c15)
and TOP_CREGS (c16-c31).

Let's also assume I have two main types of instruction: A-type
Instructions, which can use ALL 32 registers, and B-type Instructions,
which can only use the 16 BOTTOM_REGS.

IRA will correctly calculate costs (in ira-costs.c) for allocnos
appearing in B-type instructions, such that TOP_CREGS has a higher
cost than BOTTOM_REGS.  It will also calculate costs for the A-type
instructions such that TOP_CREGS and BOTTOM_REGS have the same cost.

The order of coloring will be determined by the algorithm chosen:
Priority or Chaitin-Briggs.  As each allocno is colored, the costs
will be inspected and the best available hard register will be chosen,
mainly based on the register class costs mentioned above, so allocnos
in B-type Instructions will usually be assigned a BOTTOM_REG if one is
free.  If two or more hard registers share the same cost then
whichever one appears first in the REG_ALLOC_ORDER will be assigned.
(If no hard register can be found, the allocno is assigned memory and
will require a "reload" in a later pass to get a hard register.)

I do not wish to alter the coloring algorithms or the coloring order.
I believe they are good at determing the order to color allocnos,
which dictates the likelihood of being assigned a hard register.  What
I wish to change is the hard register that is assigned, given that the
coloring order has determined that this allocno should get one next.

Why do I need to do this?  Since the BOTTOM_REGS can be used by all
instructions, it makes sense to put them first in the REG_ALLOC_ORDER,
so we minimise the number of registers consumed by a low-pressure
function.  But it also makes sense, in high-pressure functions, to
steer A-type Instructions away from using BOTTOM_REGS so that they are
free for B-type Instructions to use.

To achieve this, I tried altering the costs calculated in ira-costs.c,
either explicitly with various hacks or by altering operand
constraints.  The problem with this approach was that it is static and
independent, occurring before any coloring order has been determined
and without any awareness of the needs of other allocnos.  I believe I
require a dynamic way to alter the costs, based on which allocnos
conflict with the allocno currently being colored and which hard
registers are still available at this point.

The patch I have attached here is my first reasonable successful
attempt at this dynamic approach, which has led to performance
improvements on some of our benchmarks and no significant
regressions.

I am hoping it will be useful to others, but I post it more as a
talking point or perhaps to inspire others to come up with better
solutions and share them with me :-)


bolton_ira_subset_experiments.diff
Description: bolton_ira_subset_experiments.diff


RE: Possible IRA improvements for irregular register architectures

2010-01-04 Thread Ian Bolton
Happy New Year!

I was hoping for some kind of response to this, but maybe I didn't give
enough info?  I'd appreciate some pointers on what I could do to prompt
some discussion because I have some promising new ideas that lead on
from what I've described below.

Cheers,
Ian

> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf
Of
> Ian Bolton
> Sent: 18 December 2009 15:34
> To: gcc@gcc.gnu.org
> Subject: Possible IRA improvements for irregular register
architectures
> 
> Let's assume I have two sub-classes of ALL_REGS: BOTTOM_REGS (c0-c15)
> and TOP_CREGS (c16-c31).
> 
> Let's also assume I have two main types of instruction: A-type
> Instructions, which can use ALL 32 registers, and B-type Instructions,
> which can only use the 16 BOTTOM_REGS.
> 
> IRA will correctly calculate costs (in ira-costs.c) for allocnos
> appearing in B-type instructions, such that TOP_CREGS has a higher
> cost than BOTTOM_REGS.  It will also calculate costs for the A-type
> instructions such that TOP_CREGS and BOTTOM_REGS have the same cost.
> 
> The order of coloring will be determined by the algorithm chosen:
> Priority or Chaitin-Briggs.  As each allocno is colored, the costs
> will be inspected and the best available hard register will be chosen,
> mainly based on the register class costs mentioned above, so allocnos
> in B-type Instructions will usually be assigned a BOTTOM_REG if one is
> free.  If two or more hard registers share the same cost then
> whichever one appears first in the REG_ALLOC_ORDER will be assigned.
> (If no hard register can be found, the allocno is assigned memory and
> will require a "reload" in a later pass to get a hard register.)
> 
> I do not wish to alter the coloring algorithms or the coloring order.
> I believe they are good at determing the order to color allocnos,
> which dictates the likelihood of being assigned a hard register.  What
> I wish to change is the hard register that is assigned, given that the
> coloring order has determined that this allocno should get one next.
> 
> Why do I need to do this?  Since the BOTTOM_REGS can be used by all
> instructions, it makes sense to put them first in the REG_ALLOC_ORDER,
> so we minimise the number of registers consumed by a low-pressure
> function.  But it also makes sense, in high-pressure functions, to
> steer A-type Instructions away from using BOTTOM_REGS so that they are
> free for B-type Instructions to use.
> 
> To achieve this, I tried altering the costs calculated in ira-costs.c,
> either explicitly with various hacks or by altering operand
> constraints.  The problem with this approach was that it is static and
> independent, occurring before any coloring order has been determined
> and without any awareness of the needs of other allocnos.  I believe I
> require a dynamic way to alter the costs, based on which allocnos
> conflict with the allocno currently being colored and which hard
> registers are still available at this point.
> 
> The patch I have attached here is my first reasonable successful
> attempt at this dynamic approach, which has led to performance
> improvements on some of our benchmarks and no significant
> regressions.
> 
> I am hoping it will be useful to others, but I post it more as a
> talking point or perhaps to inspire others to come up with better
> solutions and share them with me :-)


RE: Possible IRA improvements for irregular register architectures

2010-01-11 Thread Ian Bolton
In the absence of any discussion, I just went ahead and implemented my
new ideas, which are attached as a patch!

Whilst it is tailored to the specifics of our architecture - where we
have 32 C_REGS, of which there are 16 BOTTOM_REGS, and some insns
that only work with BOTTOM_REGS - I believe it could be adapted to
other architectures where you want to optimise the allocation of
any special subsets of your register set.

The key file to look at is ira-color.c.  I have something called
"determine_max_abcd", which considers all conflicting allocnos for the
one currently being colored, and finds the maximum depth of overlap
of those yet-to-be-colored allocnos that require BOTTOM_REGS and those
already-colored allocnos that have already been assigned a BOTTOM_REG.

I use this to tell me whether to give out a BOTTOM_REG to an allocno
that can use any of our 32 registers.  If the ABCD is greater than or
equal to the number of BOTTOM_REGS then I adjust the costs of the
BOTTOM_REGS to nudge this allocno towards using a TOP_CREG instead.
Without this intervention, a future allocno that needed BOTTOM_REGS
would not have got one and we would need to plant a spill and fill
later to satisfy the needs of its instruction.

This improves performance on all benchmarks I have run on our
architecture.  I still have other ideas for making it even better,
but I figured now was a good time to share what I have.

Comments and questions would be greatly appreciated.

Cheers,
Ian

> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf
Of
> Ian Bolton
> Sent: 04 January 2010 14:19
> To: gcc@gcc.gnu.org
> Subject: RE: Possible IRA improvements for irregular register
> architectures
> 
> Happy New Year!
> 
> I was hoping for some kind of response to this, but maybe I didn't
give
> enough info?  I'd appreciate some pointers on what I could do to
prompt
> some discussion because I have some promising new ideas that lead on
> from what I've described below.
> 
> Cheers,
> Ian
> 
> > -Original Message-----
> > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf
> Of
> > Ian Bolton
> > Sent: 18 December 2009 15:34
> > To: gcc@gcc.gnu.org
> > Subject: Possible IRA improvements for irregular register
> architectures
> >
> > Let's assume I have two sub-classes of ALL_REGS: BOTTOM_REGS
(c0-c15)
> > and TOP_CREGS (c16-c31).
> >
> > Let's also assume I have two main types of instruction: A-type
> > Instructions, which can use ALL 32 registers, and B-type
> Instructions,
> > which can only use the 16 BOTTOM_REGS.
> >
> > IRA will correctly calculate costs (in ira-costs.c) for allocnos
> > appearing in B-type instructions, such that TOP_CREGS has a higher
> > cost than BOTTOM_REGS.  It will also calculate costs for the A-type
> > instructions such that TOP_CREGS and BOTTOM_REGS have the same cost.
> >
> > The order of coloring will be determined by the algorithm chosen:
> > Priority or Chaitin-Briggs.  As each allocno is colored, the costs
> > will be inspected and the best available hard register will be
> chosen,
> > mainly based on the register class costs mentioned above, so
allocnos
> > in B-type Instructions will usually be assigned a BOTTOM_REG if one
> is
> > free.  If two or more hard registers share the same cost then
> > whichever one appears first in the REG_ALLOC_ORDER will be assigned.
> > (If no hard register can be found, the allocno is assigned memory
and
> > will require a "reload" in a later pass to get a hard register.)
> >
> > I do not wish to alter the coloring algorithms or the coloring
order.
> > I believe they are good at determing the order to color allocnos,
> > which dictates the likelihood of being assigned a hard register.
> What
> > I wish to change is the hard register that is assigned, given that
> the
> > coloring order has determined that this allocno should get one next.
> >
> > Why do I need to do this?  Since the BOTTOM_REGS can be used by all
> > instructions, it makes sense to put them first in the
> REG_ALLOC_ORDER,
> > so we minimise the number of registers consumed by a low-pressure
> > function.  But it also makes sense, in high-pressure functions, to
> > steer A-type Instructions away from using BOTTOM_REGS so that they
> are
> > free for B-type Instructions to use.
> >
> > To achieve this, I tried altering the costs calculated in ira-
> costs.c,
> > either explicitly with various hacks or by altering operand
> > constraints.  The problem with this approach was that it is static
> and
> > independent, occurring before any coloring ord

Possible IRA bug in assign_hard_reg

2010-01-21 Thread Ian Bolton
Near the end of assign_hard_reg in ira-color.c, there is this code:


if (min_full_cost > mem_cost)
  {
if (! retry_p && internal_flag_ira_verbose > 3 && ira_dump_file !=
NULL)
fprintf (ira_dump_file, "(memory is more profitable %d vs %d) ",
 mem_cost, min_full_cost);
best_hard_regno = -1;
  }


If retry_p is true then we are in reload, so I wouldn't expect us to
override best_hard_regno in this case.  I think the code should read:


if (min_full_cost > mem_cost && ! retry_p)
  {
if (internal_flag_ira_verbose > 3 && ira_dump_file != NULL)
fprintf (ira_dump_file, "(memory is more profitable %d vs %d) ",
 mem_cost, min_full_cost);
best_hard_regno = -1;
  }


I'm probably wrong, but I wanted to check.

Cheers,
Ian


RE: Possible IRA bug in assign_hard_reg

2010-01-27 Thread Ian Bolton
Thanks for the detailed answer.

While we're on the subject of assign_hard_reg, I notice the costs and
min_cost variable are set but never used (decisions are being made with
the full_costs array and min_full_cost).  Should they be referenced
somehow or are they just redundant?

Cheers,
Ian

> -Original Message-
> From: Vladimir Makarov [mailto:vmaka...@redhat.com]
> Sent: 21 January 2010 21:08
> To: Ian Bolton
> Cc: gcc@gcc.gnu.org
> Subject: Re: Possible IRA bug in assign_hard_reg
> 
> Ian Bolton wrote:
> > Near the end of assign_hard_reg in ira-color.c, there is this code:
> >
> >
> > if (min_full_cost > mem_cost)
> >   {
> > if (! retry_p && internal_flag_ira_verbose > 3 && ira_dump_file
> !=
> > NULL)
> > fprintf (ira_dump_file, "(memory is more profitable %d vs %d) ",
> >  mem_cost, min_full_cost);
> > best_hard_regno = -1;
> >   }
> >
> >
> > If retry_p is true then we are in reload, so I wouldn't expect us to
> > override best_hard_regno in this case.  I think the code should
read:
> >
> >
> > if (min_full_cost > mem_cost && ! retry_p)
> >   {
> > if (internal_flag_ira_verbose > 3 && ira_dump_file != NULL)
> > fprintf (ira_dump_file, "(memory is more profitable %d vs %d) ",
> >  mem_cost, min_full_cost);
> > best_hard_regno = -1;
> >   }
> >
> >
> > I'm probably wrong, but I wanted to check.
> >
> >
> The original code is ok, I think.
> 
> First, retry_p is true not only from reload.  It is true when we are
> trying to assign hard register after IR flattening during which new
> allocnos can be created to resolve cycles for register shuffling on
> region borders.
> 
> If memory is more profitable we should use independently from where we
> call assign_hard_reg.
> 
> Retry_p is used as guard for dump printing because it is a part of
dump
> when allocno is popped from coloring  stack (in this case retry_p is
> false).



Question about IRA (setup_allocno_priorities)

2010-01-28 Thread Ian Bolton
Hi again,

Thanks for your answer to my other question.

I just found a case, where an allocno wasn't getting a register,
when I thought it should, since it was referenced 24 times.
I looked in setup_allocno_priorities() to see how this was used
to determine the priority and found this line:

mult = floor_log2 (ALLOCNO_NREFS (a)) + 1;

All other things being equal, this means something that is
referenced 16 times will get the same priority as something
referenced 31 times.  I just changed this line to remove the
floor_log2 and got better results on my test case.

I'm wondering what the origins of the floor_log2 are?  You
scale everything based on max_priority and INT_MAX later,
so I don't see the harm in using the nrefs as the multiplier
in the first instance, and I actually think this would be
more representative of the impact of an allocno not
getting a register.

NOTE: I am still using the Priority algorithm because that
performs better than CB on our architecture at the moment.

Cheers,
Ian


RE: GCC freezing for a Multiply Chain

2010-02-15 Thread Ian Bolton
Hi Balaji.

If you have access to GCC source and are willing to change it, there is
a
way to turn off IRA (the Integrated Register Allocator) and just use
a "fast allocation" algorithm that otherwise only gets used for -O0.

Should IRA be the issue, this may be a workaround for you in the
short-term:

In GCC version 4.4.1, the last function of ira-color.c is called
ira-color().  Simply make it always call fast_allocation instead
of only if ira_conflict_p is false.

Best regards,
Ian

> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf
Of
> balaji.i...@gtri.gatech.edu
> Sent: 15 February 2010 17:04
> To: gcc@gcc.gnu.org
> Subject: GCC freezing for a Multiply Chain
> 
> Hello Everyone,
>   I am creating a benchmark, where I have a following code:
> 
> x1 = a * b
> x2 = x1 * a;
> x3 = x1* x2;
> x4 = x2 * x3;
> x5 = x3 * x4;
> x6 = x5 * x6;
>   . . .
>   . . .
>x1000 = x999 * x998;
> 
> 
> 
>  When I do this, and compile using -O3/-O2/-O1, the compiler freezes.
> Am I doing anything wrong?
> 
> GCC Version: 4.2.1
> Platform: x86_64 (in macOS Snow Leopard).
> 
> I also tested this on a i386 (Red Hat 4.3.2-7) machine with gcc 4.3.2
> and I had the same response.
> 
> Please CC me in the response since I m not a subscribed member of this
> mailing list.
> 
> Any help is highly appreciated!
> 
> Thanks,
> 
> 
> Balaji V. Iyer.


Function versioning tests?

2010-02-19 Thread Ian Bolton
Hi there,

I've changed our private port of GCC to give versioned functions
better names (rather than T.0, T.1), and was wondering if there
are any existing tests that push function-versioning to the limit,
so I can test whether my naming scheme is sound.

Failing that, I'd appreciate some pointers on how I might make
such a test.  I know I need to be passing a constant in as a
parameter, but I don't know what other criteria are required to
make it trigger.

Cheers,
Ian


RE: Function versioning tests?

2010-02-22 Thread Ian Bolton
Hi Bingfeng.

Thanks for pointing me at that patch, but it doesn't do what I require,
which is probably fortunate because I would have wasted work otherwise!

My change incorporates the callee function name and caller function
name into the new name (e.g. bar.0.foo), so that we can trace that
name back to the original non-versioned function in the source, without
requiring additional debugging information; a standard build contains
all the info we need because it is held in the function names.

Are there any versioning tests for the patch you mention?

Cheers,
Ian

> -Original Message-
> From: Bingfeng Mei [mailto:b...@broadcom.com]
> Sent: 22 February 2010 09:58
> To: Ian Bolton; gcc@gcc.gnu.org
> Subject: RE: Function versioning tests?
> 
> Hi,
> GCC 4.5 already contains such patch.  http://gcc.gnu.org/ml/gcc-
> patches/2009-03/msg01186.html
> If you are working on 4.4 branch, you can just apply the patch without
> problem.
> 
> Cheers,
> Bingfeng
> 
> > -Original Message-
> > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On
> > Behalf Of Ian Bolton
> > Sent: 19 February 2010 17:09
> > To: gcc@gcc.gnu.org
> > Subject: Function versioning tests?
> >
> > Hi there,
> >
> > I've changed our private port of GCC to give versioned functions
> > better names (rather than T.0, T.1), and was wondering if there
> > are any existing tests that push function-versioning to the limit,
> > so I can test whether my naming scheme is sound.
> >
> > Failing that, I'd appreciate some pointers on how I might make
> > such a test.  I know I need to be passing a constant in as a
> > parameter, but I don't know what other criteria are required to
> > make it trigger.
> >
> > Cheers,
> > Ian
> >
> >


RE: Coloring problem - Pass 0 for finding allocno costs

2010-03-18 Thread Ian Bolton
> The problem I see is that for registers 100,101 I get best register
> class D instead of R - actually they get the same cost and D is chosen
> (maybe because it is first).

Hi Frank.

Do D and R overlap?  It would be useful to know which regs are in
which class, before trying to understand what is going on.

Can you paste an example of your define_insn from your MD file to show
how operands from D or R are both valid?  I ask this because it is
possible to express that D is more expensive than R with operand
constraints.

For general IRA info, you might like to look over my long thread on
here called "Understanding IRA".

Cheers,
Ian


RE: Coloring problem - Pass 0 for finding allocno costs

2010-03-18 Thread Ian Bolton
> -Original Message-
> From: Frank Isamov [mailto:frank.isa...@gmail.com]
> Sent: 18 March 2010 14:29
> To: Ian Bolton
> Subject: Re: Coloring problem - Pass 0 for finding allocno costs
> 
> On Thu, Mar 18, 2010 at 3:51 PM, Ian Bolton 
> wrote:
> >> The problem I see is that for registers 100,101 I get best register
> >> class D instead of R - actually they get the same cost and D is
> chosen
> >> (maybe because it is first).
> >
> > Hi Frank.
> >
> > Do D and R overlap?  It would be useful to know which regs are in
> > which class, before trying to understand what is going on.
> >
> > Can you paste an example of your define_insn from your MD file to
> show
> > how operands from D or R are both valid?  I ask this because it is
> > possible to express that D is more expensive than R with operand
> > constraints.
> >
> > For general IRA info, you might like to look over my long thread on
> > here called "Understanding IRA".
> >
> > Cheers,
> > Ian
> >
> 
> Hi Ian,
> 
> Thank you very much for your prompt reply.
> D and R are not overlap. Please see fragments of .md and .h files
> below:
> 
> From the md:
> 
> (define_register_constraint "a" "R_REGS" "")
> (define_register_constraint "d" "D_REGS" "")
> 
> (define_predicate "a_operand"
> (match_operand 0 "register_operand")
> {
> unsigned int regno;
> if (GET_CODE (op) == SUBREG)
> op = SUBREG_REG (op);
> regno = REGNO (op);
> return (regno >= FIRST_PSEUDO_REGISTER ||
> REGNO_REG_CLASS(regno) == R_REGS);
> }
> )
> 
> (define_predicate "d_operand"
> (match_operand 0 "register_operand")
> {
> unsigned int regno;
> if (GET_CODE (op) == SUBREG)
> op = SUBREG_REG (op);
> regno = REGNO (op);
> return (regno >= FIRST_PSEUDO_REGISTER ||
> REGNO_REG_CLASS(regno) == D_REGS);
> }
> )
> 
> (define_predicate "a_d_operand"
>   (ior (match_operand 0 "a_operand")
>(match_operand 0 "d_operand")))
> 
> (define_predicate "i_a_d_operand"
>   (ior (match_operand 0 "immediate_operand")
>(match_operand 0 "a_d_operand")))
> 
> (define_insn "mov_regs"
>   [(set (match_operand:SISFM 0 "a_d_operand" "=a, a, a, d, d, d")
> (match_operand:SISFM 1 "i_a_d_operand"   "a, i, d, a,
> i, d"))]
>   ""
>   "move\t%1, %0"
> )
> 
> (define_insn "*addsi3_1"
>   [(set (match_operand:SI 0 "a_d_operand""=a, a,  a,d,d")
> (plus:SI (match_operand:SI 1 "a_d_operand" "%a, a,
a,d,d")
> (match_operand:SI 2 "nonmemory_operand"
> "U05,S16,a,d,U05")))]
>   ""
>   "adda\t%2, %1, %0"
> )
> 
> ;;  Arithmetic Left and Right Shift Instructions
> (define_insn "3"
>   [(set (match_operand:SCIM 0 "register_operand" "=a,d,d")
> (sh_oprnd:SCIM (match_operand:SCIM 1 "register_operand"
> "a,d,d")
> (match_operand:SI 2
> "nonmemory_operand" "U05,d,U05")))
>(clobber (reg:CC CC_REGNUM))]
>   ""
>   "\t%2, %1, %0"
> )
> 
> From the h file:
> 
> #define REG_CLASS_CONTENTS
> \
>   {
>  \
> {0x, 0x, 0x}, /* NO_REGS*/  \
> {0x, 0x, 0x}, /* D_REGS*/  \
> {0x, 0x, 0x}, /* R_REGS*/   \
> 
> ABI requires use of R registers for arguments and return value. Other
> than that all of these instructions are more or less symmetrical in
> sense of using D or R. So, an optimal choice would be use of R for
> this example. And if D register is chosen, it involves additional copy
> from R to D and back to R.
> 
> Thank you, Frank


Hi Frank,

Have you defined REG_ALLOC_ORDER?  All other things being equal,
IRA will choose the register that comes first in the REG_ALLOC_ORDER.

In terms of operand constraints, if you replace 'd' with '?d' in
each of your operand constraints, it will show IRA that using D_REGS
is slightly more expensive than using R_REGS.  This may not always
be true, however, if the value being put in the D_REG never needs to
be returned, so this might cause IRA to use up all the R_REGS first
with things that need not be there.  I didn't like using constraints
because of this issue - the extra cost sometimes incurred is
conditional, and you can't represent that in the constraints.

Have you defined REGISTER_MOVE_COST?  In defaults.h, it is set to
2, which should be what you are looking for.  If you can get IRA to
see that the best COVER_CLASS is R_REGS then, when D_REGS is given,
there will be an added cost of 2 for that option.

One other thing: Don't overlook Pass 1.  That's where the real costs
are calculated.

I hope some of this helps.  If not, you'll have to send me your
test program and ira dump file.

Best regards,
Ian


RE: Coloring problem - Pass 0 for finding allocno costs

2010-03-18 Thread Ian Bolton
> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
> Michael Matz
> Sent: 18 March 2010 15:13
> To: Frank Isamov
> Cc: gcc@gcc.gnu.org
> Subject: Re: Coloring problem - Pass 0 for finding allocno costs
> 
> Hi,
> 
> On Thu, 18 Mar 2010, Frank Isamov wrote:
> 
> > From the h file:
> >
> > #define REG_CLASS_CONTENTS
>    \
> >  {
> >             \
> >    {0x, 0x, 0x}, /* NO_REGS*/          \
> >    {0x, 0x, 0x}, /* D_REGS*/          \
> >    {0x, 0x, 0x}, /* R_REGS*/           \
> >
> > ABI requires use of R registers for arguments and return value. Other
> > than that all of these instructions are more or less symmetrical in
> > sense of using D or R.
> 
> If that is so, you're better off not using two different register
> classes
> at all.  Register classes are there to describe assymetry in the ISA
> register file.  For describing ABI constraints like argument or
> caller-save registers look at FUNCTION_ARG, FUNCTION_ARG_REGNO_P,
> FUNCTION_VALUE_REGNO_P and CALL_USED_REGISTERS (and some friends).
> 
> The compiler will e.g. try to make sure to first use call-clobbered
> registers in leaf functions, so that they don't need to be
> saved/restored
> in the prologue/epilogue.  You don't need register classes to describe
> this.

Good spot!  But ...

In the original post, Frank mentioned that all reg classes in an
insn must match ("all registers in an insn should be from the same
register class"), so I don't think a single class would suffice because
there would be no way to enforce the above requirement.

However, I think a superclass that contains both would work, and the
CALL_USED_REGISTERS define can be used to show just the R_REGS.
Meanwhile, the matching class requirement can still be enforced via the
operand constraints.

I might be wrong, however.  I am still learning about IRA too, so feel
free to teach me something new ;-)


Understanding Scheduling

2010-03-19 Thread Ian Bolton
Hi folks!

I've moved on from register allocation (see Understanding IRA thread)
and onto scheduling.

In particular, I am investigating the effectiveness of the sched1
pass on our architecture and the associated interblock-scheduling
optimisation.


Let's start with sched1 ...

For our architecture at least, it seems like Richard Earnshaw is
right that sched1 is generally bad when you are using -Os, because
it can increase register pressure and cause extra spill/fill code when
you move independent instructions in between dependent instructions.

For example:

LOAD c2,c1[0]
LOAD c3,c1[1]
ADD c2,c2,c3  # depends on LOAD above it (might stall)
LOAD c3,c1[2]
ADD c2,c2,c3  # depends on LOAD above it (might stall)
LOAD c3,c1[3]
ADD c2,c2,c3  # depends on LOAD above it (might stall)
LOAD c3,c1[4]
ADD c2,c2,c3  # depends on LOAD above it (might stall)

might become:

LOAD c2,c1[0]
LOAD c3,c1[1]
LOAD c4,c1[2] # independent of first two LOADS
LOAD c5,c1[3] # independent of first two LOADS
ADD c2,c2,c3  # not dependent on preceding two insns (avoids stall)
LOAD c3,c1[4]
ADD c2,c2,c4  # not dependent on preceding three insns (avoids stall)
...

This is a nice effect if your LOAD instructions have a latency of 3,
so this should lead to performance increases, and indeed this is
what I see for some low-reg-pressure Nullstone cases.  Turning
sched1 off therefore causes a regression on these cases.

However, this pipeline-type effect may increase your register
pressure such that caller-save regs are required and extra spill/fill
code needs to be generated.  This happens for some other Nullstone
cases, and so it is good to have sched1 turned off for them!

It's therefore looking like some kind of clever hybrid is required.

I mention all this because I was wondering which other architectures
have turned off sched1 for -Os?  More importantly, I was wondering
if anyone else had considered creating some kind of clever hybrid
that only uses sched1 when it will increase performance without
increasing register pressure?

Or perhaps I could make a heuristic based on the balanced-ness of the
tree?  (I see sched1 does a lot better if the tree is balanced, since
it has more options to play with.)


Now onto interblock-scheduling ...

As we all know, you can't have interblock-scheduling enabled unless
you use the sched1 pass, so if sched1 is off then interblock is
irrelevant.  For now, let's assume we are going to make some clever
hybrid that allows sched1 when we think it will increase performance
for Os and we are going to keep sched1 on for O2 and O3.

As I understand it, interblock-scheduling enlarges the scope of
sched1, such that you can insert independent insns from a
completely different block in between dependent insns in this
block.  As well as potentially amortizing stalls on high latency
insns, we also get the chance to do "meatier" work in the destination
block and leave less to do in the source block.  I don't know if this
is a deliberate effect of interblock-scheduling or if it is just
a happy side-effect.

Anyway, the reason I mention interblock-scheduling is that I see it
doing seemingly intelligent moves, but then the later BB-reorder pass
is juggling blocks around such that we end up with extra code inside
hot loops!  I assume this is because the scheduler and BB-reorderer
are largely ignorant of each other, and so good intentions on the
part of the former can be scuppered by the latter.

I was wondering if anyone else has witnessed this madness on their
architecture?  Maybe it is a bug with BB-reorder?  Or maybe it should
only be enabled when function profiling information (e.g. gcov) is
available?  Or maybe it is not a high-priority thing for anyone to
think about because no one uses interblock-scheduling?

If anyone can shed some light on the above, I'd greatly appreciate
it.  For now, I will continue my experiments with selective enabling
of sched1 for -Os.

Best regards,
Ian


RE: Understanding Scheduling

2010-03-19 Thread Ian Bolton
> > Let's start with sched1 ...
> >
> > For our architecture at least, it seems like Richard Earnshaw is
> > right that sched1 is generally bad when you are using -Os, because
> > it can increase register pressure and cause extra spill/fill code
> when
> > you move independent instructions in between dependent instructions.
> 
> Please note that Vladimir Makarov implemented register pressure-aware
> sched1
> for GCC 4.5, activated with  -fsched-pressure.  I thought I should
> mention
> this because your e-mail omits it completely, so it's hard to tell
> whether you
> tested it.
> 

Hi Alexander,

We are on GCC 4.4 at the moment.  I don't see us moving up in the
near future, but I could apply the patch and see what it does for us.

Cheers,
Ian


RE: Understanding Scheduling

2010-03-19 Thread Ian Bolton
> > I mention all this because I was wondering which other architectures
> > have turned off sched1 for -Os?  More importantly, I was wondering
> > if anyone else had considered creating some kind of clever hybrid
> > that only uses sched1 when it will increase performance without
> > increasing register pressure?
> >
> >
> http://gcc.gnu.org/ml/gcc-patches/2009-09/msg3.html
> 
> Another problem is that sched1 for architectures with few registers
can
> result in reload failure.  I tried to fix this in the patch mentioned
> above but I am not sure it is done for all targets and all possible
> programs.  The right solution for this would be implementing hard
> register spills in the reload.

I don't think we have so few registers that reload failure will occur,
so it might be worth me trying this.

> 
> The mentioned above code does not work for RA based on priority
> coloring
> because register pressure calculation for intersected or nested
classes
> has a little sense.

Hmm.  Thanks for mentioning that.  As you might recall, we are using
priority coloring at the moment because it yielded better performance
than Chaitin-Briggs.  Well, the real reason CB was rejected was that
we were already using Priority and so moving to CB would cause
sufficient
disruption such that performance increases and decreases would be
inevitable.  I did get some gains, but I also got regressions that we
couldn't absorb at the time I did the work.

I might be revisiting CB soon though, as it did tend to yield smaller
code, which is becoming more important to us. 

> 
> If scheduling for the target is very important (as for itanium or
> in-order execution power6), I'd recommend to look at the selective
> scheduler.

I don't think scheduling is highly important for us, but I will take
a look at the selective scheduler.

> 
> > Or perhaps I could make a heuristic based on the balanced-ness of
the
> > tree?  (I see sched1 does a lot better if the tree is balanced,
since
> > it has more options to play with.)
> >
> >
> >
> The register pressure is already mostly minimized when shed1 starts to
> work.

I guess this is a factor in the unbalancedness of the tree.  The more
you balance it, the more likely it will get wider and require more
registers.  But the wider and more balanced, the more options for sched1
and the more chance of a performance win (assuming the increase in reg
pressure does not outweigh the scheduling performance win.)

> > Now onto interblock-scheduling ...
> >
> > As we all know, you can't have interblock-scheduling enabled unless
> > you use the sched1 pass, so if sched1 is off then interblock is
> > irrelevant.  For now, let's assume we are going to make some clever
> > hybrid that allows sched1 when we think it will increase performance
> > for Os and we are going to keep sched1 on for O2 and O3.
> >
> > As I understand it, interblock-scheduling enlarges the scope of
> > sched1, such that you can insert independent insns from a
> > completely different block in between dependent insns in this
> > block.  As well as potentially amortizing stalls on high latency
> > insns, we also get the chance to do "meatier" work in the
destination
> > block and leave less to do in the source block.  I don't know if
this
> > is a deliberate effect of interblock-scheduling or if it is just
> > a happy side-effect.
> >
> > Anyway, the reason I mention interblock-scheduling is that I see it
> > doing seemingly intelligent moves, but then the later BB-reorder
pass
> > is juggling blocks around such that we end up with extra code inside
> > hot loops!  I assume this is because the scheduler and BB-reorderer
> > are largely ignorant of each other, and so good intentions on the
> > part of the former can be scuppered by the latter.
> >
> >
> That is right.  It would be nice if somebody solves the problem.

Hmm.  If we keep sched1 on, then maybe I will be the man to do it!

Best regards,
Ian


RE: Understanding Scheduling

2010-03-22 Thread Ian Bolton
> Enabling BB-reorder only if profile info is available, is not the
> right way to go. The compiler really doesn't place blocks in sane
> places without it -- and it shouldn't have to, either. For example if
> you split an edge at some point, the last thing you want to worry
> about, is where the new basic block is going to end up.
> 
> There are actually a few bugs in Bugzilla about BB-reorder, FWIW.

I've done a few searches in Bugzilla and am not sure if I have found
the BB reorder bugs you are referring to.

The ones I have found are:

16797: Opportunity to remove unnecessary load instructions
41396: missed space optimization related to basic block reorder
21002: RTL prologue and basic-block reordering pessimizes delay-slot
   filling.

(If you can recall any others, I'd appreciate hearing of them.)

Based on 41396, it looks like BB reorder is disabled for -Os.  But
you said in your post above that "the compiler really doesn't place
blocks in sane places without it", so does that mean that we could
probably increase performance for -Os if BB reorder was (improved)
and enabled for -Os?

Cheers,
Ian


BB reorder forced off for -Os

2010-03-23 Thread Ian Bolton
Is there any reason why BB reorder has been disabled
in bb-reorder.c for -Os, such that you can't even
turn it on with -freorder-blocks?

From what I've heard on this list in recent days,
BB reorder gives good performance wins such that
most people would still want it on even if it did
increase code size a little.

Cheers,
Ian


RE: BB reorder forced off for -Os

2010-03-30 Thread Ian Bolton
> We're not able to enable BB reordering with -Os.  The behaviour is
> hard-coded via this if statement in rest_of_handle_reorder_blocks():
> 
>   if ((flag_reorder_blocks || flag_reorder_blocks_and_partition)
>   /* Don't reorder blocks when optimizing for size because extra
> jump insns may
>be created; also barrier may create extra padding.
> 
>More correctly we should have a block reordering mode that
tried
> to
>minimize the combined size of all the jumps.  This would more
or
> less
>automatically remove extra jumps, but would also try to use
more
> short
>jumps instead of long jumps.  */
>   && optimize_function_for_speed_p (cfun))
> {
>   reorder_basic_blocks ();
> 
> If you comment out the "&& optimize_function_for_speed_p (cfun)" then
> BB reordering takes places as desired (although this isn't a solution
> obviously).
> 
> In a private message Ian indicated that this had a small impact for
the
> ISA he's working with but a significant performance gain.  I tried the
> same thing with the ISA I work on (Ubicom32) and this change typically
> increased code sizes by between 0.1% and 0.3% but improved performance
> by anything from 0.8% to 3% so on balance this is definitely winning
> for most of our users (this for a couple of benchmarks, the Linux
> kernel, busybox and smbd).
> 

It should be noted that commenting out the conditional to do with
optimising for speed will make BB reordering come on for all functions,
even cold ones, so I think whatever gains have come from making this
hacky change could increase further if BB reordering is set to
only come on for hot functions when compiling with -Os.  (Certainly
the code size increases could be minimised, whilst hopefully retaining
the performance gains.)

Note that I am in no way suggesting this should be the default
behaviour for -Os, but that it should be switchable via the
flags just like other optimisations are.  But, once it is switchable,
I expect choosing to turn it on for -Os should not cause universal
enabling of BB reordering for every function (as opposed to the current
universal disabling of BB reordering for every function), but a sensible
half-way point, based on heat, so that you get the performance wins with
minimal code size increases on selected functions.

Cheers,
Ian


Making BB Reorder Smarter for -Os

2010-04-08 Thread Ian Bolton
Bugzilla 41004 calls for a more -Os-friendly algorithm for BB Reorder,
and I'm hoping I can meet this challenge.  But I need your help!

If you have any existing ideas or thoughts that could help me get
closer to a sensible heuristic sooner, then please post them to this
list.

In the mean time, here's my progress so far:

My first thought was to limit BB Reorder to hot functions, as identified
by gcov, so we could get maximum execution time benefit for minimised
code
size impact.  Based on benchmarking I've done so far, this looks to be a
winner, but it only works when profile data is available.

Without profile data, we face two main problems:

1) We cannot easily estimate how many times we will execute our more
efficient code-layout, so we can't do an accurate trade-off versus the
code size increase.

2) The traces that BB Reorder constructs will be based on static branch
prediction, rather than true dynamic flows of execution, so the new
layout may not be the best one in practice.

We can address #1 by tagging functions as hot (using attributes), but
that may not always be possible and it does not guarantee that we will
get minimal codesize increases, which is the main aim of this work.

I'm not sure how #2 can be addressed, so I'm planning to sidestep it
completely, since the problem isn't really the performance pay-off but
the codesize increase that usually comes with each new layout of a
function that BB Reorder makes.

My current plan is to characterise a function within find_traces()
(looking at things like the number of traces, edge probabilities and
frequencies, etc) and only call connect_traces() to effect the
reordering
change if these characteristics suggest that minimal code disruption
will
occur and/or maximum performance pay-off.

Thanks for reading and I look forward to your input!

Cheers,
Ian





Question about tree-switch-conversion.c

2010-08-10 Thread Ian Bolton
I am in the process of fixing PR44328
(http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44328)

The problem is that gen_inbound_check in tree-switch-conversion.c subtracts
info.range_min from info.index_expr, which can cause the MIN and MAX values
for info.index_expr to become invalid.

For example:


typedef enum {
  FIRST = 0,
  SECOND,
  THIRD,
  FOURTH
} ExampleEnum;

int dummy (const ExampleEnum e)
{
  int mode = 0;
  switch (e)
  {
case SECOND: mode = 20; break;
case THIRD: mode = 30; break;
case FOURTH: mode = 40; break;
  }
  return mode;
}


tree-switch-conversion would like to create a lookup table for this, so
that SECOND maps to entry 0, THIRD maps to entry 1 and FOURTH maps to
entry 2.  It achieves this by subtracting SECOND from index_expr.  The
problem is that after the subtraction, the type of the result can have a
value outside the range 0-3.

Later, when tree-vrp.c sees the inbound check as being <= 2 with a possible
range for the type as 0-3, it converts the <=2 into a != 3, which is
totally wrong.  If e==FIRST, then we can end up looking for entry 255 in
the lookup table!

I think the solution is to update the type of the result of the subtraction
to show that it is no longer in the range 0-3, but I have had trouble
implementing this.  The attached patch (based off 4.5 branch) shows my
current approach, but I ran into LTO issues:

lto1: internal compiler error: in lto_get_pickled_tree, at lto-streamer-in.c

I am guessing this is because the debug info for the type does not match
the new range I have set for it.

Is there a *right* way to update the range such that LTO doesn't get
unhappy?  (Maybe a cast with fold_convert_loc would be right?)


pr44328.gcc4.5.fix.patch
Description: Binary data


RE: Link error

2010-08-12 Thread Ian Bolton
Phung Nguyen wrote:
> I am trying to build cross compiler for xc16x. I built successfully
> binutils 2.18; gcc 4.0 and newlib 1.18. Everything is fine when
> compiling a simple C file without any library call. It is also fine
> when making a simple call to printf like printf("Hello world").
> However, i got error message from linker when call printf("i=%i",i);

I don't know the answer, but I think you are more likely to get one
if you post to gcc-h...@gcc.gnu.org.  The gcc@gcc.gnu.org list is for
people developing gcc, rather than only building or using it.

I hope you find your answer soon.

Best regards,
Ian





Shouldn't unsafe-math-optimizations (re-)enable fp-contract=fast?

2014-03-06 Thread Ian Bolton
Hi there,

I see in common.opt that fp-contract=fast is the default for GCC.

But then it gets disabled in c-family/c-opts.c if you are using ISO C
(e.g. with -std=c99).

But surely if you have also specified -funsafe-math-optimizations then
it should flip it back onto fast?

I see in gcc/opts.c that a few other flags are triggered based on
-funsafe-math-optimizations but not -ffp-contract=fast.

I think this is a bug and could lead to less optimal code than people
would like if they want c99 compliance for reason A but also unsafe
math for reason B.

What do you think?

Cheers,
Ian





RE: Shouldn't unsafe-math-optimizations (re-)enable fp-contract=fast?

2014-03-07 Thread Ian Bolton


> -Original Message-
> From: Joseph S. Myers
> On Thu, 6 Mar 2014, Ian Bolton wrote:
>
> > I see in common.opt that fp-contract=fast is the default for GCC.
> >
> > But then it gets disabled in c-family/c-opts.c if you are using ISO C
> > (e.g. with -std=c99).
> >
> > But surely if you have also specified -funsafe-math-optimizations
> then
> > it should flip it back onto fast?
> 
> That seems reasonable.
> 

Thanks for the feedback, Joseph.

I've been working on the patch, but I see in gcc/opts.c that there is
a definite distinction between set_fast_math_flags and
set_unsafe_math_optimizations_flags.  I'm thinking this is more of a
fast-math thing than an unsafe_math_optimizations thing, so I should
actually be adding it in set_fast_math_flags.

Is this correct?


Whilst I'm on, I think I found a bug in the Optimize-Options 
documentation ...

It says that -ffast-math is enabled by -Ofast and that -ffast-math
enabled -funsafe-math-optimizations. But the definition of
-funsafe-math-optimizations says it is not turned on by any -O option.
I think it should say that it is turned on by -Ofast (via -ffast-math),
just for clarity and consistency.

Does that make sense?

Many thanks,
Ian




> --
> Joseph S. Myers
> jos...@codesourcery.com






RE: Shouldn't unsafe-math-optimizations (re-)enable fp-contract=fast?

2014-03-07 Thread Ian Bolton
> Thanks for the feedback, Joseph.
> 
> I've been working on the patch, but I see in gcc/opts.c that there is
> a definite distinction between set_fast_math_flags and
> set_unsafe_math_optimizations_flags.  I'm thinking this is more of a
> fast-math thing than an unsafe_math_optimizations thing, so I should
> actually be adding it in set_fast_math_flags.
> 
> Is this correct?

Scratch that. I went ahead and tried to set it in opts.c, but the c-opts.c
override (to OFF) will currently happen in every case UNLESS you explicitly
asked for -ffp-contract=fast.

I therefore think the better solution is now to not override it in c-opts
if unsafe-math-optimizations (which itself is enabled by -ffast-math and/or
-Ofast) is enabled.  We already have fp-contract=fast on by default, so we
should just stop it being turned off unnecessarily instead.

I've successfully coded this one and it works as expected when compiled
with "-Ofast -std=c99".  I'll prepare the patch now.

Cheers,
Ian