Using 4.4.0 gcc, I compiled a function and found it a tad long. The
command line is:
gcc -Os -mcpu=arm7tdmi-s -S func.c
although the output is pretty much the same with -O2 or -O3 as well (only
a few instructions longer).
The function is basically an unrolled 32 bit unsigned division by 1E9:
unsigned int divby1e9( unsigned int num, unsigned int *quotient )
{
unsigned int dig;
unsigned int tmp;
tmp = 1000000000u;
dig = 0;
if ( num >= tmp ) {
tmp <<= 2;
if ( num >= tmp ) {
num -= tmp;
dig = 4;
}
else {
tmp >>= 1;
if ( num >= tmp ) {
num -= tmp;
dig = 2;
}
tmp >>= 1;
if ( num >= tmp ) {
num -= tmp;
dig++;
}
}
}
*quotinet = dig;
return num;
}
The compiler generated the following code:
divby1e9:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
ldr r3, .L10
cmp r0, r3
movls r3, #0
bls .L3
ldr r2, .L10+4
cmp r0, r2
addhi r0, r0, #293601280
addhi r0, r0, #1359872
addhi r0, r0, #6144
movhi r3, #4
bhi .L3
.L4:
ldr r2, .L10+8
cmp r0, r2
movls r3, #0
bls .L6
add r0, r0, #-2013265920
add r0, r0, #13238272
add r0, r0, #27648
cmp r0, r3
movls r3, #2
bls .L3
mov r3, #2
.L6:
add r0, r0, #-1006632960
add r0, r0, #6619136
add r0, r0, #13824
add r3, r3, #1
.L3:
str r3, [r1, #0]
bx lr
.L11:
.align 2
.L10:
.word 999999999
.word -294967297
.word 1999999999
Note that it is sub-optimal on two counts.
First, each loading of a constant takes 3 instructions and 3 clocks.
Storing the constant and fetching it using an ldr also takes 3 clocks but
only two 32-bit words and identical constants need to be stored only once.
The speed increase is only true on the ARM7TDMI-S, which has no caches, so
that's just a minor issue, but the memory saving is true no matter what
ARM core you have (note that -Os was specified).
Second, and this is the real problem, if the compiler did not want to be
overly clever and compiled the code as it was written, then instead of
loading the constants 4 times, at the cost of 3 instuctions each, it could
have loaded it only once and then generated the next constants at the cost
of a single-word, single clock shift. The code would have been rather
shorter *and* faster, plus some of the jumps could have been eliminated.
Practically each C statement line (except the braces) corresponds to one
assembly instruction, so without being clever, just translating what's
written, it could be done in 20 words instead of 30.
Is it a problem that is worth being put onto bugzilla or I just have to do
some trickery to save the compiler from being smarter than it is?
Zoltan