On Mon, 16 Feb 2009 10:17:36 -0500 Daniel Jacobowitz <d...@false.org> wrote:
> On Mon, Feb 16, 2009 at 12:19:52PM +0100, Vincent R. wrote: > > 00011000 <WinMainCRTStartup>: > > [...] > > Notice how many more registers used to be pushed? I expect the new > code is faster. Assuming an ARM7 core with 0 wait-state memory and removing all the identical call bits from the functions, the clocks are on the right hand side: 11000: e92d40f0 push {r4, r5, r6, r7, lr} 7 11004: e1a04000 mov r4, r0 1 11008: e1a05001 mov r5, r1 1 1100c: e1a06002 mov r6, r2 1 11010: e1a07003 mov r7, r3 1 11024: e1a01005 mov r1, r5 1 11028: e1a00004 mov r0, r4 1 1102c: e1a02006 mov r2, r6 1 11030: e1a03007 mov r3, r7 1 11038: e1a04000 mov r4, r0 1 11040: e1a01004 mov r1, r4 1 11044: e3a00042 mov r0, #66 1 Total: 12 insns, 18 clocks 11000: e92d4010 push {r4, lr} 4 11004: e1a04000 mov r4, r0 1 11008: e24dd00c sub sp, sp, #12 1 1100c: e58d1008 str r1, [sp, #8] 2 11010: e58d2004 str r2, [sp, #4] 2 11014: e58d3000 str r3, [sp] 2 11028: e59d1008 ldr r1, [sp, #8] 3 1102c: e1a00004 mov r0, r4 1 11030: e59d2004 ldr r2, [sp, #4] 3 11034: e59d3000 ldr r3, [sp] 3 1103c: e1a04000 mov r4, r0 1 11044: e1a01004 mov r1, r4 1 11048: e3a00042 mov r0, #66 1 Total: 13 insns, 25 clocks. So the version generated by the 4.4.x compiler version is almost 40% slower (25-18)/18 = 0.3889) than the 4.1.x version and it is also longer. Pushing many registers is cheap because you it takes 2+n clocks to move n registers to memory, and then it is n extra clocks to copy your n registers to the call-saved ones that you pushed. Total cost 2+2n. Storing them individually costs you 1 clock to make space on the stack, 3n clocks to store them on the stack, i.e. 1+3n. In addition, when you get them to become parameters to the function calls, a reg-reg move costs you 1 clock while a load from memory is 3. The example function does not actually return, but if it did, the old compiler would lose some of its advantage. The old compiler would finish the function with pop {r4,r5,r6,r7,pc} (9 clocks, final: 13 insns 27 clocks) and the new compiler's version would be add sp,sp,#12 (1 clock) pop {r4,pc} (6 clocks, final: 15 insns 32 clocks) Even then the old compiler would still beat the new one both in size and speed. Zoltan