Here's an interesting thing I'm learning about the kind of optimizations that might be in the EU of the newer machines. This started out as a pretty simple 'measure the branch penalty' exercise.
Given this program: int b(int i){ return i+1; } main(){ int i; for(i = 0; i < 1000000000; i = b(i)); if (i != 1000000000) printf("Fix me\n"); } The assembly code on newer micros looks like this: .file "sa.c" .text .globl b .type b, @function b: pushl %ebp movl %esp, %ebp movl 8(%ebp), %eax addl $1, %eax popl %ebp ret .size b, .-b .section .rodata .LC0: .string "Fix me" .text .globl main .type main, @function main: leal 4(%esp), %ecx andl $-16, %esp pushl -4(%ecx) pushl %ebp movl %esp, %ebp pushl %ecx subl $20, %esp movl $0, -8(%ebp) jmp .L4 .L5: movl -8(%ebp), %eax movl %eax, (%esp) call b movl %eax, -8(%ebp) .L4: cmpl $999999999, -8(%ebp) jle .L5 cmpl $1000000000, -8(%ebp) je .L10 movl $.LC0, (%esp) call puts .L10: addl $20, %esp popl %ecx popl %ebp leal -4(%ecx), %esp ret .size main, .-main .ident "GCC: (GNU) 4.1.2 20070925 (Red Hat 4.1.2-27)" .section .note.GNU-stack,"",@progbits OK, let's modify b: a little as follows; b: #if JMP > 0 jmp bb #if JMP > 1 call b #if JMP > 2 call b #if JMP > 3 call b #endif #endif #endif #endif bb: So, simple: if JMP is 0, no change, if JMP is 1, we do (in essence) br .+2, if JMP is 2, we do br .+7, etc. now I time the run 10 times (I can run longer but it seems good enough to establish behavior). I should get some rough idea of the cost of the branch. Attachment 1 shows this cpu running in 32-bit mode: processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 3 model name : Intel(R) Pentium(R) 4 CPU 3.40GHz stepping : 4 cpu MHz : 3415.468 cache size : 1024 KB the short form: the penalty for the branch is zero. As I say, I can run it until I get getter statistics, but the trend is clear. Note this was an unloaded machine. Well, how about my laptop? The penalty for the branch is negative. Our guess: the very highly dynamic power management is playing tricks with our minds. But we don't know. But the code with the branch is consistently 25% faster. vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz stepping : 6 cpu MHz : 1000.000 cache size : 4096 KB how about a 64-bit cpu? cpu family : 15 model : 3 model name : Intel(R) Xeon(TM) CPU 3.40GHz stepping : 4 cpu MHz : 3400.283 cache size : 1024 KB third graph. Here you can see the penalty increased a bit as the size of the jmp increased. Don't try this with 8a. 8a is too damn smart -- it just optimizes the branch out (unless there is a switch to avoid doing that). You need a dumb assembler. ron
<<attachment: prism.gif>>
<<attachment: results.gif>>
<<attachment: runtimes.64bit.gif>>