If anyone wants to flex their asm-hacking skills, there's some draft code here. If not, don't panic; this is primarily notes to myself that someone else might be interested in.
To finish long long support in printf, I need to print 64-bit numbers (preferably, arbitrary-precision numbers) in octal, and it seems to be taking as long to write as the decimal code. Hex is easy to do byte at a time, but octal is messy because 3 doesn't divide 8. It's not that it's difficult, per se. Or that I need to stress about efficiency (to a first approximation, nobody uses octal, so it matters even less than usual). The problem is that messy translates to large code (boo!), and it's hard to see how large an alternative will be without writing it in detail. The old code just masked and printed the low-order bits, did a full-width shift by 1 bit, repeated it 3x, and continued as long as the result wasn't zero. It did that with a 12-instruction helper routine to shift the buffer by cnt bits: .L_lsr_4: ldi cnt, 4 .L_lsr: lsr v_fifth ror v_hhi ror v_hlo ror v_hi ror v_lo dec cnt brne .L_lsr ; tst sbiw v_hlo, 0 ; only Z flag is needed cpc v_lo, rzero cpc v_hi, rzero ret This was shared between the hex/octal print code and the decimal print code, so the rest of the hex/octal print code was only 20 instructions (+4 to branch to it depending on "base" on function entry). The equivalent for a variable-length memory buffer is rather more awkward, 14 instructions which *aren't* shared with the decimal print code: .L_lsr_4: ldi cnt, 4 .L_lsr: movw ZL, r24 sub merge, merge ; Set to to zero and clear carry mov tlen, len 1: ld r0, -Z ror r0 st Z, r0 or merge, r0 dec tlen bne 1b dec cnt bne .L_lsr tst merge ret Worse, the inner loop is 9 cycles to shift 1 byte by 1 bit, and the outer loop is another 5 per bit. Shifting an 8-byte value is 9 * 8 + 5 = 77 cycles per bit. If I have a 64-bit input to print, then that's 77 * 64 = 4928 cycles just spent shifting! Given that I can print a *decimal* 64-bit number in 4103 cycles, that seems Just Wrong. And even assuming I could keep the rest to 20 instructions, that's a total of 34. (It's not obvious I can, because while the shift leaves the lsbyte in r0 ready for printing, the first digit doesn't have a shift, so it needs to be handled specially.) After a lot of messing about, I came up with the following O(n) scheme. It buffers two registers, keeping 8..15 bits of the input buffered at any given time. Shifting is bit at a time, with the buffer refilled every 8 bits, and a digit printed every 3. (Or 4. The code is not only shared with hex, but can handle bases 2, 4 or 32 as well.) A trick I use repeatedly is bit shifting for loop counters. Rather than initialize a counter to "3" and decrement it each iteration, I initialize a counter to "7" (a value I already have available for digit-extraction purposes) and shift it right each iteration. For the bit buffer, I have 8..15 bits of buffered data, and rather than have a separate counter, mark the end of the valid data with a 1 bit. So the data starts out like "1abcdefg hijklmno" and shifts down until it's "00000001 abcdefgh". At that point, we start the next shift, but the msbyte becomes zero (with the 1 bit stored in the carry). That's the signal to fetch another byte. This reduces the loop overhead in both time and code size. Here's the current draft. Suggestions solicited! Registers: len: Length of input (in bytes). Shared code with the decimal print has stripped off leading zero bytes. X: Points to output buffer Z: Points to input buffer flags: Input flags distinguishing decimal, octal, and upper/lower case hex msb, lsb: Current bit buffer. Most significant set bit of msb counts bits per byte. mask: A bitmask the size of the current base's digits (7 or 15) tmask: A loop counter initialized to "mask" which counts bits per digit digit: a temporary used to form the ASCII digit delta: The amount to add to digits past '9' to make them start at 'a' or 'A'. .L_octhex: ldi mask, 7 lsr flags ; bit 1 in carry, but 1 in register breq 1f ldi mask, 15 ldi delta, 'A'-'0'+10 brcc 1f ldi delta, 'a'-'0'+10 1: ldi msb, 1 ldi tmask, 1 ld lsb, Z+ rjmp 2f .L_bitloop: ; Main loop ror lsb 2: lsr tmask ; Entry point, tmask >>= 1 brne 4f ; if (tmask == 0) { ; Spit out a digit of (lsb & mask) mov digit, lsb and digit, mask cpi digit, 10 brcs 3f add digit, delta 3: add digit, '0' st X+, digit mov tmask, mask 4: ; } lsr msb brne .L_bitloop ; End of main loop ; Load another byte ; Note: we know that carry is SET dec len ; Load another byte breq .L_lastbyte ld msb, Z+ ror msb ; Shift carry=1 into msbit rjmp .L_bitloop .L_lastbyte: lastbyte: adc msb, tmask ; Pad until end of digit (and clear carry) inc len cpse lsb, __zero_reg__ rjmp 4b movw r24, r26 ; Copy X to return value ret This is 35 instructions in its current state. I can probably shave one off the end by sharing the epilogue with the decimal print code, and I'm thinking of making the "mask" value be what's passed in from the caller to select the base, which will simplify the prologue. The main loop has two conditional branches, either of which could be the loop end. I could save a cycle (but cost an instruction) in the main loop by placing both conditions out of line. The other tricky thing is the "lastbyte" section, invoked when we need another byte but len == 0. When we get to the end of the input, we need to pad with zeros until the significant digits have all been printed. This involves setting msb to some suitable all-zero, and keeping going until lsb == 0 as well. We stop when len == 0 and lsb == 0. len is decremented *after* using the byte, so when len is decremented to 0, the last byte has already been used and we need to load an msbyte of padding. There are three ways to check for this, and I *think* I've picked the one with minimum code size (and fastest among those), but let me analyze them in detail. In the first two cases, we can pad with a full byte of 8 zero bits, then the "lastbyte" code is simply "ror msb" (which sets msb to 0x80 and clears the carry in one instruction, since we know we started with msb = 0 and carry = 1) and "rjmp .L_bitloop". Oh! Those instructions already exist at the end of the "load another byte" branch, so maybe the savings is more than I thought. The three ways of checking and the additional costs are: 1) Check each iteration of the main bit loop. This is slowest, so only use it if it has no rivals for smallest. This can be placed after the "ror lsb", adding three more instructions: branch if not equal, test len == 0, and branch if equal to epilogue code. 2) Check with each digit printed. This is identical code to the above, but inside the "if (!tmask)" condition to reduce overhead. The problem is that we don't get the test of lsb for free any more, so I think it's four instructions. No! We can do it in three: cp lsb, __zero_reg__ $ cpc len,__zero_reg__ $ breq epilogue CPC ANDs its result with the previous zero flag. *If* the zero flag was set (the only case we care about) then the carry flag is clear, and the cpc will compute the test we want. If the carry is set, the cpc will do the wrong thing, but that doesn't matter because zero is already clear. 3) The way I did it, checking with each byte fetched. We have to test len == 0 anyway, so we just have to check lsb == 0. But we need four additional instructions: - We have to use "adc msb,tmask" to set the msb correctly, so we'll fall back into the byte-load check after only one more digit, and - We have to "inc len" so it will decrement again properly. - And we need two instructions to test lsb == 0 I thought option 3 was smallest because I thought the "rol msb" the other cases required was the same size as "adc msb.tmask". I didn't notice the 2-instruction code sharing. So now it seems that option 2 is the smallest. Here it is in 34 instructions: .L_octhex: ldi mask, 7 lsr flags ; bit 1 in carry, but 1 in register breq 1f ldi mask, 15 ldi delta, 'A'-'0'+10 brcc 1f ldi delta, 'a'-'0'+10 1: ldi msb, 1 ldi tmask, 1 ld lsb, Z+ rjmp 2f .L_bitloop: ; Main loop ror lsb 2: lsr tmask ; Entry point, tmask >>= 1 brne 4f ; if (tmask == 0) { ; Check if we're done cp lsb, __zero_reg__ cpc len, __zero_reg__ breq .L_epilogue ; Spit out a digit of (lsb & mask) mov digit, lsb and digit, mask cpi digit, 10 brcs 3f add digit, delta 3: add digit, '0' st X+, digit mov tmask, mask 4: ; } lsr msb brne .L_bitloop ; End of main loop ; Load another byte ; Note: we know that msb == 0 and carry is SET dec len ; Load another byte breq 5f ld msb, Z+ 5: ror msb ; Shift carry=1 into msbit rjmp .L_bitloop .L_epilogue: movw r24, r26 ; Copy X to return value ret This is reasonably small (although there is definitely a size hit relative to the previous code, sigh) and reasonably efficient. As I said, the 2-instruction epilogue can be shared with the decimal code, and the prologue code might be shortened by 3 instructions by choosing the "flags" values so they can be used directly as "mask". That would cut it to 29. (Plus two in the common code to branch to this depending on the flags.) I have to code it up and test it, but I wanted to write down my thoughts about it first, and that seemed post-worthy.
_______________________________________________ AVR-libc-dev mailing list AVR-libc-dev@nongnu.org https://lists.nongnu.org/mailman/listinfo/avr-libc-dev