Really nice analysis!  Thanks for writing this up.

Sorry for the big quote below, but:

Jan Hubicka <hubi...@ucw.cz> writes:
>> > +/* Implement TARGET_IRA_CALLEE_SAVED_REGISTER_COST_SCALE.  */
>> > + 
>> > +static int
>> > +ix86_ira_callee_saved_register_cost_scale (int)
>> > +{
>> > +  return 1;
>> > +}
>> > +
>
>> >     return cl;
>> >   }
>> > +int
>> > +default_ira_callee_saved_register_cost_scale (int)
>> > +{
>> > +  return (optimize_size
>> > +    ? 1
>> > +    : REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun)));
>> > +}
>> > +
>
> I am not sure how this makes sense - why x86 would be significantly
> different from other targets?
>
> I think the only bit non-standard thing is that prologue/epilogue code
> can use push/pop that is shorter than mov used to save caller saved
> registers.
>
> I went through few testcases:
>
> void d();
> void a()
> {       
>         int b;
>         asm ("use %0":"=r" (b));
>         d();
>         asm volatile (""::"r" (b));
> }
> compiler with -O2 -fira-verbose=10000 gives:
>
> Popping a0(r99,l0)  -- (0=12000,12000) (1=12000,12000) (2=12000,12000) 
> (4=12000,12000) (5=12000,12000) (36=12000,12000) (37=12000,12000) 
> (38=12000,12000) (39=12000,12000) (3=11000,11000) (6=11000,11000) 
> (40=11000,11000) (41=11000,11000) (42=11000,11000) (43=11000,11000)
>
> load and save costs are 6. So spill pair is 12 weighted by 1000 that is
> REG_FREQ_MAX.
>
> Register 0 (EAX) has cost 12000 which makes sense to me:
>   - load and save costs are 6, combined spill pair is 12
>   - REG_FREQ_MAX is 1000 and since function has only one BB, it has
>     maximal frequency, so we get 12000.
>
> Register 3 (first caller saved) has cost 11000.  This comes from:
>             add_cost = ((ira_memory_move_cost[mode][rclass][0]
>                          + ira_memory_move_cost[mode][rclass][1])
>                         * saved_nregs / hard_regno_nregs (hard_regno,
>                                                           mode) - 1)
>                                                                 ^^
>                                                                 here
>
>                        * (optimize_size ? 1 :
>                           REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun)));
>
> There is no comment why -1, but I suppose it is there to biass costs to
> use prologue/epilogue instad of caller save sequence when runtime cost
> estimate is even.
>
> Now for
> void d();
> void a()
> {
>         for (int i = 0; i < 100; i++)
>           d();
>         int b;
>         asm ("use %0":"=r" (b));
>         d();
>         asm volatile (""::"r" (b));
> }
>
> I get
>
>       Popping a0(r100,l0)  -- (0=120,120) (1=120,120) (2=120,120) (4=120,120) 
> (5=120,120) (36=120,120) (37=120,120) (38=120,120) (39=120,120) (3=0,0) 
> (6=110,110) (40=110,110) (41=110,110) (42=110,110) (43=110,110)
>
> This also makes sense to me, since there is loop the basic block has
> lower frequency of 10, thus costs are scaled down.
>
> void d();
> int cnd;
> void a()
> {
>         int b;
>         asm ("use %0":"=r" (b));
>
>         if (__builtin_expect_with_probability (cnd, 1, 0.8))
>           d();
>         asm volatile (""::"r" (b));
> }
>
> I get
>
>      Popping a0(r100,l0)  -- (0=9600,9600) (1=9600,9600) (2=9600,9600) 
> (4=9600,9600) (5=9600,9600) (36=9600,9600) (37=9600,9600) (38=9600,9600) 
> (39=9600,9600) (3=11000,11000) (6=11000,11000) (40=11000,11000) 
> (41=11000,11000) (42=11000,11000) (43=11000,11000)
>
> which seems also correct.  It is better to use caller saved registr
> since call to d() has lower frequency then the entry basic block. This
> is what gcc 13 and this patch gets wrong
>
>      Popping a0(r100,l0)  -- (1=9600,9600) (2=9600,9600) (4=9600,9600) 
> (5=9600,9600) (36=9600,9600) (37=9600,9600) (38=9600,9600) (39=9600,9600) 
> (3=11,11) (6=11,11) (40=11,11) (41=11,11) (42=11,11) (43=11,11)
>
> Due to missing scaling factor we think that using callee saved registr
> is win while it is not.  GCC13 gets this wrong even for probability 0.
>
> Looking into PRs referneced in the patch:
> PR111673 is the original bug that motivated correcting the cost (adding
>          the scale by entry block frequency)
> PR115932 is cris-elf I don't know how to bencmark easily.
> PR116028 seems to be about shrink wrapping in
>
> void f(int *i)
> {
>         if (!i)
>                 return;
>         else
>         {
>                 __builtin_printf("Hi");
>                 *i=0;
>         }
> }
>
> here I see tha tthe cost model misses the fact that epilogue will be
> shrink-wrapped so both caller and callee saving will result in one spill
> after the early exit.
>
> PR117081 is about regression in povray. The reducted testcase:
>
> void foo (void);
> void bar (void);
>
> int
> test (int a)
> {
>   int r;
>
>   if (r = -a)
>     foo ();
>   else
>     bar ();
>
>   return r;
> }
>
> shows that we now use caller saved register (EAX) to hold the return value 
> which yields longer code.  The costs are
> Popping a0(r98,l0)  -- (0=13000,13000) (3=15000,15000) (6=15000,15000) 
> (40=15000,15000) (41=15000,15000) (42=15000,15000) (43=15000,15000)
>
> here 15000 is 11000+4000 where I think 4000 is cost of 2 reg-reg moves
> multiplied by REG_FREQ_MAX.   This seems correct. GCC 13 uses callee
> saved register and produces:
>
> 0000000000000000 <test>:
>    0: 53                      push   %rbx             <--- callee save
>    1: 89 fb                   mov    %edi,%ebx        <--- move 1
>    3: f7 db                   neg    %ebx
>    5: 74 09                   je     10 <test+0x10>
>    7: e8 00 00 00 00          call   c <test+0xc>
>    c: 89 d8                   mov    %ebx,%eax        <--- callee restore
>    e: 5b                      pop    %rbx
>    f: c3                      ret
>   10: e8 00 00 00 00          call   15 <test+0x15>
>   15: 89 d8                   mov    %ebx,%eax        <--- move 2
>   17: 5b                      pop    %rbx             <--- callee restore
>   18: c3                      ret
>
> Mainline used EAX since it has costs 13000.  It is not 100% clear to me
> why.
>  - 12000 is the spilling (which is emitted twice but executed just once)
>  - I would have expected 2000 for the move from edi to eax.
> However even if cost is 14000 we will choose EAX.  The code is:
>
>    0: 89 f8                   mov    %edi,%eax        <--- move1
>    2: 48 83 ec 18             sub    $0x18,%rsp       <--- stack frame 
> creation
>    6: f7 d8                   neg    %eax
>    8: 89 44 24 0c             mov    %eax,0xc(%rsp)   <--- spill out
>    c: 85 ff                   test   %edi,%edi
>    e: 74 10                   je     20 <test+0x20>
>   10: e8 00 00 00 00          call   15 <test+0x15>
>   15: 8b 44 24 0c             mov    0xc(%rsp),%eax   <--- spill in
>   19: 48 83 c4 18             add    $0x18,%rsp       <--- stack frame
>   1d: c3                      ret
>   1e: 66 90                   xchg   %ax,%ax
>   20: e8 00 00 00 00          call   25 <test+0x25>
>   25: 8b 44 24 0c             mov    0xc(%rsp),%eax   <--- spill in
>   29: 48 83 c4 18             add    $0x18,%rsp       <--- stack frame
>   2d: c3                      ret
>
> This sequence really saves one move at the expense of of stack frame
> allocation (which is not modelled by the cost model) and longer spill
> code (also no modelled).

Kind of a tangent, but is a cost of 2 reasonable for these reg<-reg moves,
or is it too high?  Reducing moves seems like the wrong thing to optimise
for (when optimising for speed) if the moves disappear during renaming anyway.

> PR117082 is about noreturn function:
> __attribute__ ((noreturn))
> void
> f3 (void)
> {
>   int y0 = x0;
>   int y1 = x1;
>   f1 ();
>   f2 (y0, y1);
>   while (1);
> }
>
> Here the cost model is really wrong by assuming that entry and exit
> block have same frequencies.  This can be fixed quite easilly (though it
> is a rare case)

Yeah (and nice example).

> PR118497  seems to be ixed.
>
> So overall I think
>  1) we can fix scaling of epilogue by exit block frequency
>     to get noreturns right.
>  2) we should drop the check for optimize_size.  Since with -Os
>     REG_FREQ_FROM_BB always returns 1000 everything should be scaled
>     same way
>  3) we currently have wire in "-1" to biass the cost metric for callee
>     saved registers.
>     It may make sense to allow targets to control this, since i.e. x86
>     has push/pop that is shorter. -3 would solve the testcase with neg
>     and would express that push/pop is still cheaper with extra reg-reg
>     move.
>  4) cost model misses shring wrapping, the fact that if register is
>     callee saved it may be used by multiple allocnos and also that
>     push/pop sequence may avoid need for manual RSP adjustments.
>
>     Those seems bit harder things to fit in though.
>
> So if we want to go with the target hook, I think it should adjust the
> cost before scalling (since targets may have special tricks for
> prologues) rather than the scale factor (which is target independent
> part of cost model).

Like you say, one of the missing pieces appears to be the allocation/
dealloaction overhead for callee saves.  Could we try to add a hook to
model that cost, based on certain parameters?

In particular, one thing that the examples above have in common is that
they don't need to allocate a frame for local variables.  That seems
like it ought to be part of the mix.  If we need to allocate a frame
using addition anyway, then presumably one of the advantages of push/pop
over callee saves goes away.

But going back to my question above, modelling the allocation and
deallocation would need to be done in a way that makes them more
expensive than moves.  I think in some cases we've assumed that
addition and moves have equal cost, even when optimising for speed.

In other words, rather than add a hook to tweak the -1 bias (which I
still agree is a more reasonable thing to do than bypassing the IRA
code altogether), could we add a separate hook for the allocation/
deallocation and leave IRA to work out when to apply it?

I suppose for -fno-omit-frame-pointer we should also take the creation
of the frame link into account, if we don't already.  This matters on
aarch64 since -fomit-frame-pointer is not the default at -O2.

One quirk on aarch64 is that, if we're using an odd number of
callee-saved GPRs, adding one more is essentially free, since we would
allocate space for it anyway, and could save and restore it alongside
the odd one out.  That would probably be difficult to apply in practice
though.  And it's obviously a separate issue from the current one.
Just mentioning it as another thing we could model in future.

Thanks,
Richard

Reply via email to