> >
> >    0:       89 f8                   mov    %edi,%eax        <--- move1
> >    2:       48 83 ec 18             sub    $0x18,%rsp       <--- stack 
> > frame creation
> >    6:       f7 d8                   neg    %eax
> >    8:       89 44 24 0c             mov    %eax,0xc(%rsp)   <--- spill out
> >    c:       85 ff                   test   %edi,%edi
> >    e:       74 10                   je     20 <test+0x20>
> >   10:       e8 00 00 00 00          call   15 <test+0x15>
> >   15:       8b 44 24 0c             mov    0xc(%rsp),%eax   <--- spill in
> >   19:       48 83 c4 18             add    $0x18,%rsp       <--- stack frame
> >   1d:       c3                      ret
> >   1e:       66 90                   xchg   %ax,%ax
> >   20:       e8 00 00 00 00          call   25 <test+0x25>
> >   25:       8b 44 24 0c             mov    0xc(%rsp),%eax   <--- spill in
> >   29:       48 83 c4 18             add    $0x18,%rsp       <--- stack frame
> >   2d:       c3                      ret
> >
> > This sequence really saves one move at the expense of of stack frame
> > allocation (which is not modelled by the cost model) and longer spill
> > code (also no modelled).
> 
> Kind of a tangent, but is a cost of 2 reasonable for these reg<-reg moves,
> or is it too high?  Reducing moves seems like the wrong thing to optimise
> for (when optimising for speed) if the moves disappear during renaming anyway.

The costs are scaled so 2 is cost of usual reg-reg move.  This
definitely predates me, but it is how reload worked.  It used cost 2 for
move and 1 was used to add various biasses (such as ! in constraints).
It used scale of 3^depth to avoid spilling in deeper loop nests and I
have introduced the frequency metric instead.

Loat/store costs are thus relative to reg-reg move cost divided by two.
For generic we set cost of load/store to be 6, which is usual latency of
the operation (3 cycles).  On modern x86 architectures this is all
relative, since reg-reg move has good chance to have latency of 1
and store+load pairs may execute faster, too.

> 
> Like you say, one of the missing pieces appears to be the allocation/
> dealloaction overhead for callee saves.  Could we try to add a hook to
> model that cost, based on certain parameters?
> 
> In particular, one thing that the examples above have in common is that
> they don't need to allocate a frame for local variables.  That seems
> like it ought to be part of the mix.  If we need to allocate a frame
> using addition anyway, then presumably one of the advantages of push/pop
> over callee saves goes away.

Even if you need to allocate frame, push/pop has advantage of short
encoding. But indeed it would be interesting to model the fact that
introducing frame has some cost.

On modern x86 the cost of stack adjustments is relatively low, since
they has stack engines.
> 
> But going back to my question above, modelling the allocation and
> deallocation would need to be done in a way that makes them more
> expensive than moves.  I think in some cases we've assumed that
> addition and moves have equal cost, even when optimising for speed.

Execution costs of stack adjustment are probalby not much bigger then
those of reg-reg move.  so cost of 2 or 3 would make sense to me
> 
> In other words, rather than add a hook to tweak the -1 bias (which I
> still agree is a more reasonable thing to do than bypassing the IRA
> code altogether), could we add a separate hook for the allocation/
> deallocation and leave IRA to work out when to apply it?
> 
> I suppose for -fno-omit-frame-pointer we should also take the creation
> of the frame link into account, if we don't already.  This matters on
> aarch64 since -fomit-frame-pointer is not the default at -O2.
> 
> One quirk on aarch64 is that, if we're using an odd number of
> callee-saved GPRs, adding one more is essentially free, since we would
> allocate space for it anyway, and could save and restore it alongside
> the odd one out.  That would probably be difficult to apply in practice
> though.  And it's obviously a separate issue from the current one.
> Just mentioning it as another thing we could model in future.
Looks like fun, indeed.

Honza
> 
> Thanks,
> Richard

Reply via email to