If speed is more important than size (likely as most AVR chips today have much more flash than those available long long ago when that code was originally written), here ia a proposed (untested) patch to unroll the loop (simply repeat the code 8 times) for >8K flash devices,
The larger and faster version doesn't use r23 as loop count, telling GCC it is not clobbered (which may make better code around the calls to this function) is left as an exercise for the reader. So it could actually be a win in some cases on the smaller chips too. Your call. Thanks, Marek --- libgcc_config_avr_lib1funcs.S.orig 2024-04-21 22:22:35.231870200 +0200 +++ libgcc_config_avr_lib1funcs.S 2024-04-21 23:11:54.285118400 +0200 @@ -1340,8 +1340,17 @@ #if defined (L_udivmodqi4) DEFUN __udivmodqi4 clr r_rem ; clear remainder - ldi r_cnt,8 ; init loop counter lsl r_arg1 ; shift dividend +#ifdef __AVR_HAVE_JMP_CALL__ /* Optimize speed: 40 words, 40 cycles, r_cnt not used. */ +.rept 8 + rol r_rem ; shift dividend into remainder + cp r_rem,r_arg2 ; compare remainder & divisor + brcs 1f ; remainder <= divisor + sub r_rem,r_arg2 ; restore remainder +1: rol r_arg1 ; shift dividend (with CARRY) +.endr +#else /* Optimize size: 8 words, 64 cycles. */ + ldi r_cnt,8 ; init loop counter __udivmodqi4_loop: rol r_rem ; shift dividend into remainder cp r_rem,r_arg2 ; compare remainder & divisor @@ -1351,6 +1360,7 @@ rol r_arg1 ; shift dividend (with CARRY) dec r_cnt ; decrement loop counter brne __udivmodqi4_loop +#endif com r_arg1 ; complement result ; because C flag was complemented in loop ret Dnia Sun, Apr 21, 2024 at 03:22:31PM +0200, Georg-Johann Lay napisał(a): > Am 21.04.24 um 10:08 schrieb Wolfgang Hospital:> Dear all,> > > Is there a test scaffold for the functions from lib1funcs.S, > > correctness, size&speed over the variety of 8-bit AVR cores? > > Size is the easiest one: Just determine the size of, say > -nodefaultlibs -nostartfiles against a respective compilation > with -Wl,-u,__divmodqi4 > > Benchmarking speed is not so easy. I am using the avrtest core > simulator because it is fast, simulating a core is enough, and > it has some extra features, e.g. get random values and get values > out of the target, e.g. LOG_FMT_DOUBLE ("double = %f\n", x); > > https://github.com/sprintersb/atest > > See the end of this mail for an example. > > For correctness, most of the functions are tested off testsuite > by hand-written programs that test new implementations against > existing ones, like in the code below. Such tests don't make sense > any more when the new version is integrated. And performance > tests / comparisons are misplaced in the GCC testsuite anyway. > > > Is there a more comprehensive statement of calling conventions than > > https://gcc.gnu.org/wiki/avr-gcc#Exceptions_to_the_Calling_Convention, > > It is comprehensive, but likely not complete. For completeness, you'll > have to resort to avr.md and the files it includes. There is no > table that lists the non-ABT stuff though; you'll have to find the > transparent calls, usually of type "xcall". Notice however that > such functions may be ABI or non-ABI. Transparent calls are basically > used for two purposes: > * Non-ABI calls like some mul stuff that gets param in X reg. > * ABI calls that don't clobber all callee-used regs, in order to > model the smaller footprint. > > > in particular explicitly stating which functions are guaranteed to have > > __zero_reg__ 0 on entry/where it suffices to have __zero_reg__ 0 on > > return as opposed to preserving its value? > > When a function does /not/ have zero_reg=0 on entry, then the compiler > or libc (or application code) has a bug. Same when zero_reg!=0 on > exit. > > > I've been tinkeringaround, the "ldi r_cnt, 9""rjmp entry point" in > > __udivmodqi4 instead of "ldi r_cnt, 8""lsl r_arg1" annoying me for > > years. (Biggest relative strict improvement I found, FWIW.) > > I went ahead and applied it, see https://gcc.gnu.org/PR114794 > > In order to test it, I ran the following code with > avrtest_log -q -no-log ... > > <CODE> > #include <stdint.h> > #include "avrtest.h" > > volatile uint8_t q8, my_q8; > volatile uint8_t r8, my_r8; > > extern void __udivmodqi4 (void); > extern void my_udivmodqi4 (void); > > __asm("\n" > "r_rem = 25 /* remainder */" "\n" > "r_arg1 = 24 /* dividend, quotient */" "\n" > "r_arg2 = 22 /* divisor */" "\n" > "r_cnt = 23 /* loop count */" "\n" > ".pushsection .text" "\n" > ".global my_udivmodqi4" "\n" > "my_udivmodqi4:" "\n\t" > " sub r_rem,r_rem ; clear remainder and carry" "\n\t" > " ldi r_cnt,8 ; init loop counter" "\n\t" > " lsl r_arg1 ; shift dividend" "\n\t" > "__udivmodqi4_loop:" "\n\t" > " rol r_rem ; shift dividend into remainder" "\n\t" > " cp r_rem,r_arg2 ; compare remainder & divisor" "\n\t" > " brcs __udivmodqi4_ep ; remainder <= divisor" "\n\t" > " sub r_rem,r_arg2 ; restore remainder" "\n\t" > "__udivmodqi4_ep:" "\n\t" > " rol r_arg1 ; shift dividend (with CARRY)" "\n\t" > " dec r_cnt ; decrement loop counter" "\n\t" > " brne __udivmodqi4_loop" "\n\t" > " com r_arg1 ; complement result" "\n\t" > " ; because C flag was complemented in loop" > "\n\t" > " ret" "\n\t" > ".popsection"); > > static inline __attribute__((__always_inline__)) > void my_divmod8 (volatile uint8_t *pq, volatile uint8_t *prem, > uint8_t dividend, uint8_t divisor) > { > register uint8_t rem asm("25"); > register uint8_t q asm("24"); > register uint8_t r22 asm("22") = divisor; > register uint8_t r24 asm("24") = dividend; > asm ("%~call %x[func]" > : "=r" (q), "=r" (rem) > : "r" (r22), "r" (r24), [func] "i" (my_udivmodqi4) > : "r23"); > *pq = q; > *prem = rem; > } > > static inline __attribute__((__always_inline__)) > void divmod8 (volatile uint8_t *pq, volatile uint8_t *prem, > uint8_t dividend, uint8_t divisor) > { > register uint8_t rem asm("25"); > register uint8_t q asm("24"); > register uint8_t r22 asm("22") = divisor; > register uint8_t r24 asm("24") = dividend; > asm ("%~call %x[func]" > : "=r" (q), "=r" (rem) > : "r" (r22), "r" (r24), [func] "i" (__udivmodqi4) > : "r23"); > *pq = q; > *prem = rem; > } > > void bench_divmod8 (void) > { > uint8_t a = 0; > do > { > uint8_t b = 1; > do > { > PERF_START_CALL (1); > divmod8 (&q8, &r8, a, b); > PERF_STOP (1); > > PERF_START_CALL (2); > my_divmod8 (&my_q8, &my_r8, a, b); > PERF_STOP (2); > > if (q8 != my_q8 || r8 != my_r8) > __builtin_abort(); > } while (++b); > } while (++a); > } > > int main (void) > { > bench_divmod8(); > PERF_DUMP_ALL; > return 0; > } > </CODE> > > The input space is only 16 bits wide, so a full coverage is possible. > With larger input spaces, one could use avrtest_[p]rand() or > similar means to randomize the input. > > The output is as follows: > > $ avrtest_log -mmcu=avr5 -no-log ben.elf -m 100000000 -q > > --- Dump # 1: > Timer T1 "" (65280 rounds): 00ec--00fc > Instructions Ticks > Total: 3765820 5222400 > Mean: 57 80 > Stand.Dev: 0.9 0.0 > Min: 57 80 > Max: 65 80 > Calls (abs) in [ 2, 3] was: 2 now: 2 > Calls (rel) in [ 0, 1] was: 0 now: 0 > Stack (abs) in [08fb,08f9] was:08fb now:08fb > Stack (rel) in [ 0, 2] was: 0 now: 0 > > Min round Max round Min tag / Max tag > Calls -all-same- / > Stack -all-same- / > Instr. 1 65026 -no-tag- / -no-tag- > Ticks -all-same- / > > Timer T2 "" (65280 rounds): 0108--0116 > Instructions Ticks > Total: 3569980 4896000 > Mean: 54 75 > Stand.Dev: 0.9 0.0 > Min: 54 75 > Max: 62 75 > Calls (abs) in [ 2, 3] was: 2 now: 2 > Calls (rel) in [ 0, 1] was: 0 now: 0 > Stack (abs) in [08fb,08f9] was:08fb now:08fb > Stack (rel) in [ 0, 2] was: 0 now: 0 > > Min round Max round Min tag / Max tag > Calls -all-same- / > Stack -all-same- / > Instr. 1 65026 -no-tag- / -no-tag- > Ticks -all-same- / > > So the new code requires 5 ticks less (changed from 80 to 75) > > "Calls" is the (relative or absolute) call depth. > "Stack" is the (relative or absolute) stack usage. > > Johann > > > Recommendations for a platform to vent such ideas welcome (I know of > > stackoverflow.com). > > > > regards > > > > W. Hospital > > > > -- > > Wolfgang Hospital >