Eric Weddington a écrit:
__umulhisi3 reads:
DEFUN __umulhisi3
mul A0, B0
movw C0, r0
mul A1, B1
movw C2, r0
mul A0, B1
add C1, r0
adc C2, r1
clr __zero_reg__
adc C3, __zero_reg__
mul A1, B0
add C1, r0
adc C2, r1
clr __zero_reg__
adc C3, __zero_reg__
ret
ENDF __umulhisi3
It could be compressed to the following sequence, i.e.
24 bytes instead of 30, but I think that's too much of
quenching the last byte out of the code:
DEFUN __umulhisi3
mul A0, B0
movw C0, r0
mul A1, B1
movw C2, r0
mul A0, B1
rcall 1f
mul A1, B0
1: add C1, r0
adc C2, r1
clr __zero_reg__
adc C3, __zero_reg__
ret
ENDF __umulhisi3
However, I also like your creative compression in the second sequence
above, and I think that it would be best to implement that sequence and
try to find others like that where possible.
Maybe you see a sequence that I have overlooked?
Remember that to AVR users, code size is *everything*. Even saving
6 bytes here or there has a positive effect.
Actually I had that self tail-call in my initial version.
But look at the code flow for a 32-bit multiply:
user -> mulsi3 -> muluhisi3 -> umulhisi3 -> umulhisi3.tail
That are 4 call-levels! And mulsi3 pushes 2 bytes so that
a mulsi3 costs at least 10 bytes of stack. (I prefered pushs/pops
over clobbering Z). The three calls (one of the four is inevitable)
and pushs/pops will cost already ~30 ticks!
I found that too painful, and on devices with >= 8k flash the
self-tail-call will just save 4 bytes.
One way would be to depend the self tail-call on
!__AVR_HAVE_JMP_CALL__, i.e. just do it on small devices.
Also note that on tiny devices without MUL instruction
nothing changes, anyway. The respective mulsi3 is unchanged.
Johann