As reported at the thread in http://gcc.gnu.org/ml/gcc/2009-03/msg00369.html
Using 4.4.0 gcc, I compiled a function and found it a tad long. The command line is: gcc -Os -mcpu=arm7tdmi-s -S func.c although the output is pretty much the same with -O2 or -O3 as well (only a few instructions longer). The function is basically an unrolled 32 bit unsigned division by 1E9: unsigned int divby1e9( unsigned int num, unsigned int *quotient ) { unsigned int dig; unsigned int tmp; tmp = 1000000000u; dig = 0; if ( num >= tmp ) { tmp <<= 2; if ( num >= tmp ) { num -= tmp; dig = 4; } else { tmp >>= 1; if ( num >= tmp ) { num -= tmp; dig = 2; } tmp >>= 1; if ( num >= tmp ) { num -= tmp; dig++; } } } *quotinet = dig; return num; } The compiler generated the following code: divby1e9: @ Function supports interworking. @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldr r3, .L10 cmp r0, r3 movls r3, #0 bls .L3 ldr r2, .L10+4 cmp r0, r2 addhi r0, r0, #293601280 addhi r0, r0, #1359872 addhi r0, r0, #6144 movhi r3, #4 bhi .L3 .L4: ldr r2, .L10+8 cmp r0, r2 movls r3, #0 bls .L6 add r0, r0, #-2013265920 add r0, r0, #13238272 add r0, r0, #27648 cmp r0, r3 movls r3, #2 bls .L3 mov r3, #2 .L6: add r0, r0, #-1006632960 add r0, r0, #6619136 add r0, r0, #13824 add r3, r3, #1 .L3: str r3, [r1, #0] bx lr .L11: .align 2 .L10: .word 999999999 .word -294967297 .word 1999999999 Note that it is sub-optimal on two counts. First, each loading of a constant takes 3 instructions and 3 clocks. Storing the constant and fetching it using an ldr also takes 3 clocks but only two 32-bit words and identical constants need to be stored only once. The speed increase is only true on the ARM7TDMI-S, which has no caches, so that's just a minor issue, but the memory saving is true no matter what ARM core you have (note that -Os was specified). Second, and this is the real problem, if the compiler did not want to be overly clever and compiled the code as it was written, then instead of loading the constants 4 times, at the cost of 3 instuctions each, it could have loaded it only once and then generated the next constants at the cost of a single-word, single clock shift. The code would have been rather shorter *and* faster, plus some of the jumps could have been eliminated. Practically each C statement line (except the braces) corresponds to one assembly instruction, so without being clever, just translating what's written, it could be done in 20 words instead of 30. -- Summary: Constant propagation in a number of tree passes does not take into account machine costs. Product: gcc Version: lto Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: ramana dot r at gmail dot com GCC build triplet: i686-unknown-linux-gnu GCC host triplet: i686-unknown-linux-gnu GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39468