Eric Weddington a écrit:

__umulhisi3 reads:

DEFUN __umulhisi3
   mul     A0, B0
   movw    C0, r0
   mul     A1, B1
   movw    C2, r0
   mul     A0, B1
   add     C1, r0
   adc     C2, r1
   clr     __zero_reg__
   adc     C3, __zero_reg__
   mul     A1, B0
   add     C1, r0
   adc     C2, r1
   clr     __zero_reg__
   adc     C3, __zero_reg__
   ret
ENDF __umulhisi3

It could be compressed to the following sequence, i.e.
24 bytes instead of 30, but I think that's too much of
quenching the last byte out of the code:

DEFUN __umulhisi3
   mul     A0, B0
   movw    C0, r0
   mul     A1, B1
   movw    C2, r0
   mul     A0, B1
   rcall   1f
   mul     A1, B0
1:  add     C1, r0
   adc     C2, r1
   clr     __zero_reg__
   adc     C3, __zero_reg__
   ret
ENDF __umulhisi3

However, I also like your creative compression in the second sequence
above, and I think that it would be best to implement that sequence and
try to find others like that where possible.

Maybe you see a sequence that I have overlooked?

Remember that to AVR users, code size is *everything*. Even saving
6 bytes here or there has a positive effect.

Actually I had that self tail-call in my initial version.
But look at the code flow for a 32-bit multiply:

user -> mulsi3 -> muluhisi3 -> umulhisi3 -> umulhisi3.tail

That are 4 call-levels!  And mulsi3 pushes 2 bytes so that
a mulsi3 costs at least 10 bytes of stack.  (I prefered pushs/pops
over clobbering Z).  The three calls (one of the four is inevitable)
and pushs/pops will cost already ~30 ticks!
I found that too painful, and on devices with >= 8k flash the
self-tail-call will just save 4 bytes.

One way would be to depend the self tail-call on
!__AVR_HAVE_JMP_CALL__, i.e. just do it on small devices.

Also note that on tiny devices without MUL instruction
nothing changes, anyway.  The respective mulsi3 is unchanged.

Johann

Reply via email to