On Mon, 16 Feb 2009 10:17:36 -0500
Daniel Jacobowitz <d...@false.org> wrote:

> On Mon, Feb 16, 2009 at 12:19:52PM +0100, Vincent R. wrote:
> > 00011000 <WinMainCRTStartup>:
> > [...]
> 
> Notice how many more registers used to be pushed?  I expect the new
> code is faster.

Assuming an ARM7 core with 0 wait-state memory and removing all the
identical call bits from the functions, the clocks are on the right
hand side:

   11000:       e92d40f0        push    {r4, r5, r6, r7, lr}  7
   11004:       e1a04000        mov     r4, r0                1
   11008:       e1a05001        mov     r5, r1                1
   1100c:       e1a06002        mov     r6, r2                1
   11010:       e1a07003        mov     r7, r3                1
   11024:       e1a01005        mov     r1, r5                1
   11028:       e1a00004        mov     r0, r4                1
   1102c:       e1a02006        mov     r2, r6                1
   11030:       e1a03007        mov     r3, r7                1
   11038:       e1a04000        mov     r4, r0                1
   11040:       e1a01004        mov     r1, r4                1
   11044:       e3a00042        mov     r0, #66               1

Total: 12 insns, 18 clocks

   11000:       e92d4010        push    {r4, lr}              4
   11004:       e1a04000        mov     r4, r0                1
   11008:       e24dd00c        sub     sp, sp, #12           1
   1100c:       e58d1008        str     r1, [sp, #8]          2
   11010:       e58d2004        str     r2, [sp, #4]          2
   11014:       e58d3000        str     r3, [sp]              2
   11028:       e59d1008        ldr     r1, [sp, #8]          3
   1102c:       e1a00004        mov     r0, r4                1
   11030:       e59d2004        ldr     r2, [sp, #4]          3
   11034:       e59d3000        ldr     r3, [sp]              3
   1103c:       e1a04000        mov     r4, r0                1
   11044:       e1a01004        mov     r1, r4                1
   11048:       e3a00042        mov     r0, #66               1

Total: 13 insns, 25 clocks.

So the version generated by the 4.4.x compiler version is almost 40%
slower (25-18)/18 = 0.3889) than the 4.1.x version and it is also
longer. Pushing many registers is cheap because you it takes 2+n clocks
to move n registers to memory, and then it is n extra clocks to copy
your n registers to the call-saved ones that you pushed. Total cost
2+2n. Storing them individually costs you 1 clock to make space on the
stack, 3n clocks to store them on the stack, i.e. 1+3n. In addition,
when you get them to become parameters to the function calls, a reg-reg
move costs you 1 clock while a load from memory is 3. The example
function does not actually return, but if it did, the old compiler
would lose some of its advantage. The old compiler would finish the
function with

  pop {r4,r5,r6,r7,pc} (9 clocks, final: 13 insns 27 clocks)

and the new compiler's version would be

  add sp,sp,#12 (1 clock)
  pop {r4,pc}   (6 clocks, final: 15 insns 32 clocks)

Even then the old compiler would still beat the new one both in size
and speed.

Zoltan

Reply via email to