[avr-libc-dev] Printing octal (brain dump)

George Spelvin Tue, 13 Dec 2016 22:52:58 -0800

If anyone wants to flex their asm-hacking skills, there's some draft
code here.  If not, don't panic; this is primarily notes to myself
that someone else might be interested in.


To finish long long support in printf, I need to print 64-bit numbers
(preferably, arbitrary-precision numbers) in octal, and it seems to
be taking as long to write as the decimal code.

Hex is easy to do byte at a time, but octal is messy because 3 doesn't
divide 8.  It's not that it's difficult, per se.  Or that I need to
stress about efficiency (to a first approximation, nobody uses octal,
so it matters even less than usual).

The problem is that messy translates to large code (boo!), and
it's hard to see how large an alternative will be without writing
it in detail.

The old code just masked and printed the low-order bits, did a full-width
shift by 1 bit, repeated it 3x, and continued as long as the result
wasn't zero.  It did that with a 12-instruction helper routine to shift
the buffer by cnt bits:

.L_lsr_4:
        ldi     cnt, 4
.L_lsr:
        lsr     v_fifth
        ror     v_hhi
        ror     v_hlo
        ror     v_hi
        ror     v_lo
        dec     cnt
        brne    .L_lsr
  ; tst
        sbiw    v_hlo, 0        ; only Z flag is needed
        cpc     v_lo, rzero
        cpc     v_hi, rzero
        ret

This was shared between the hex/octal print code and the decimal print
code, so the rest of the hex/octal print code was only 20 instructions
(+4 to branch to it depending on "base" on function entry).

The equivalent for a variable-length memory buffer is rather more awkward,
14 instructions which *aren't* shared with the decimal print code:

.L_lsr_4:
        ldi     cnt, 4
.L_lsr:
        movw    ZL, r24
        sub     merge, merge    ; Set to to zero and clear carry
        mov     tlen, len
1:      ld      r0, -Z
        ror     r0
        st      Z, r0
        or      merge, r0
        dec     tlen
        bne     1b
        dec     cnt
        bne     .L_lsr
        tst     merge
        ret

Worse, the inner loop is 9 cycles to shift 1 byte by 1 bit, and the outer
loop is another 5 per bit.  Shifting an 8-byte value is 9 * 8 + 5 = 77
cycles per bit.

If I have a 64-bit input to print, then that's 77 * 64 = 4928 cycles
just spent shifting!  Given that I can print a *decimal* 64-bit number
in 4103 cycles, that seems Just Wrong.

And even assuming I could keep the rest to 20 instructions, that's a
total of 34.  (It's not obvious I can, because while the shift leaves the
lsbyte in r0 ready for printing, the first digit doesn't have a shift,
so it needs to be handled specially.)


After a lot of messing about, I came up with the following O(n) scheme.
It buffers two registers, keeping 8..15 bits of the input buffered at any
given time.  Shifting is bit at a time, with the buffer refilled every 8
bits, and a digit printed every 3.  (Or 4.  The code is not only shared
with hex, but can handle bases 2, 4 or 32 as well.)

A trick I use repeatedly is bit shifting for loop counters.  Rather than
initialize a counter to "3" and decrement it each iteration, I initialize
a counter to "7" (a value I already have available for digit-extraction
purposes) and shift it right each iteration.

For the bit buffer, I have 8..15 bits of buffered data, and rather than
have a separate counter, mark the end of the valid data with a 1 bit.
So the data starts out like "1abcdefg hijklmno" and shifts down until
it's "00000001 abcdefgh".  At that point, we start the next shift,
but the msbyte becomes zero (with the 1 bit stored in the carry).
That's the signal to fetch another byte.

This reduces the loop overhead in both time and code size.


Here's the current draft.  Suggestions solicited!
Registers:
        len: Length of input (in bytes).  Shared code with
                the decimal print has stripped off leading zero
                bytes.
        X: Points to output buffer
        Z: Points to input buffer
        flags: Input flags distinguishing decimal, octal,
                and upper/lower case hex
        msb, lsb: Current bit buffer.  Most significant set bit of msb
                counts bits per byte.
        mask: A bitmask the size of the current base's digits (7 or 15)
        tmask: A loop counter initialized to "mask" which counts bits per digit
        digit: a temporary used to form the ASCII digit
        delta: The amount to add to digits past '9' to make
                them start at 'a' or 'A'.

.L_octhex:
        ldi     mask, 7
        lsr     flags           ; bit 1 in carry, but 1 in register
        breq    1f
        ldi     mask, 15
        ldi     delta, 'A'-'0'+10
        brcc    1f
         ldi    delta, 'a'-'0'+10
1:
        ldi     msb, 1
        ldi     tmask, 1
        ld      lsb, Z+
        rjmp    2f

.L_bitloop:                     ; Main loop
        ror     lsb
2:      lsr     tmask           ; Entry point, tmask >>= 1
        brne    4f              ; if (tmask == 0) {
        ; Spit out a digit of (lsb & mask)
        mov     digit, lsb
        and     digit, mask
        cpi     digit, 10
        brcs    3f
         add    digit, delta
3:      add     digit, '0'
        st      X+, digit
        mov     tmask, mask
4:                              ; }
        lsr     msb
        brne    .L_bitloop      ; End of main loop

        ; Load another byte
        ; Note: we know that carry is SET
        dec     len             ; Load another byte
        breq    .L_lastbyte
        ld      msb, Z+
        ror     msb             ; Shift carry=1 into msbit
        rjmp    .L_bitloop

.L_lastbyte:
lastbyte:
        adc     msb, tmask      ; Pad until end of digit (and clear carry)
        inc     len
        cpse    lsb, __zero_reg__
         rjmp   4b
        movw    r24, r26        ; Copy X to return value
        ret

This is 35 instructions in its current state.  I can probably shave
one off the end by sharing the epilogue with the decimal print code,
and I'm thinking of making the "mask" value be what's passed in from
the caller to select the base, which will simplify the prologue.

The main loop has two conditional branches, either of which could be
the loop end.  I could save a cycle (but cost an instruction) in the
main loop by placing both conditions out of line.

The other tricky thing is the "lastbyte" section, invoked when we need
another byte but len == 0.  When we get to the end of the input, we need
to pad with zeros until the significant digits have all been printed.
This involves setting msb to some suitable all-zero, and keeping going
until lsb == 0 as well.

We stop when len == 0 and lsb == 0.  len is decremented *after* using
the byte, so when len is decremented to 0, the last byte has already
been used and we need to load an msbyte of padding.

There are three ways to check for this, and I *think* I've picked the one
with minimum code size (and fastest among those), but let me analyze them
in detail.

In the first two cases, we can pad with a full byte of 8 zero bits,
then the "lastbyte" code is simply "ror msb" (which sets msb to 0x80
and clears the carry in one instruction, since we know we started with
msb = 0 and carry = 1) and "rjmp .L_bitloop".  Oh!  Those instructions
already exist at the end of the "load another byte" branch, so maybe
the savings is more than I thought.

The three ways of checking and the additional costs are:

1) Check each iteration of the main bit loop.  This is slowest, so only
   use it if it has no rivals for smallest.  This can be placed after
   the "ror lsb", adding three more instructions: branch if not equal,
   test len == 0, and branch if equal to epilogue code.
2) Check with each digit printed.  This is identical code to the above,
   but inside the "if (!tmask)" condition to reduce overhead.
   The problem is that we don't get the test of lsb for free any more,
   so I think it's four instructions.  No!  We can do it in three:
        cp lsb, __zero_reg__ $ cpc len,__zero_reg__ $ breq epilogue
   CPC ANDs its result with the previous zero flag.  *If* the zero flag
   was set (the only case we care about) then the carry flag is clear,
   and the cpc will compute the test we want.  If the carry is set,
   the cpc will do the wrong thing, but that doesn't matter because
   zero is already clear.
3) The way I did it, checking with each byte fetched.  We have to
   test len == 0 anyway, so we just have to check lsb == 0.
   But we need four additional instructions:
   - We have to use "adc msb,tmask" to set the msb correctly, so we'll
     fall back into the byte-load check after only one more digit, and
   - We have to "inc len" so it will decrement again properly.
   - And we need two instructions to test lsb == 0

I thought option 3 was smallest because I thought the "rol msb" the
other cases required was the same size as "adc msb.tmask".  I didn't
notice the 2-instruction code sharing.

So now it seems that option 2 is the smallest.  Here it is in 34
instructions:
  
.L_octhex:
        ldi     mask, 7
        lsr     flags           ; bit 1 in carry, but 1 in register
        breq    1f
        ldi     mask, 15
        ldi     delta, 'A'-'0'+10
        brcc    1f
         ldi    delta, 'a'-'0'+10
1:
        ldi     msb, 1
        ldi     tmask, 1
        ld      lsb, Z+
        rjmp    2f

.L_bitloop:                     ; Main loop
        ror     lsb
2:      lsr     tmask           ; Entry point, tmask >>= 1
        brne    4f              ; if (tmask == 0) {
        ; Check if we're done
        cp      lsb, __zero_reg__
        cpc     len, __zero_reg__
        breq    .L_epilogue
        ; Spit out a digit of (lsb & mask)
        mov     digit, lsb
        and     digit, mask
        cpi     digit, 10
        brcs    3f
         add    digit, delta
3:      add     digit, '0'
        st      X+, digit
        mov     tmask, mask
4:                              ; }
        lsr     msb
        brne    .L_bitloop      ; End of main loop

        ; Load another byte
        ; Note: we know that msb == 0 and carry is SET
        dec     len             ; Load another byte
        breq    5f
        ld      msb, Z+
5:      ror     msb             ; Shift carry=1 into msbit
        rjmp    .L_bitloop

.L_epilogue:
        movw    r24, r26        ; Copy X to return value
        ret

This is reasonably small (although there is definitely a size hit
relative to the previous code, sigh) and reasonably efficient.  As I said,
the 2-instruction epilogue can be shared with the decimal code, and the
prologue code might be shortened by 3 instructions by choosing the "flags"
values so they can be used directly as "mask".  That would cut it to 29.
(Plus two in the common code to branch to this depending on the flags.)

I have to code it up and test it, but I wanted to write down my thoughts
about it first, and that seemed post-worthy.

_______________________________________________
AVR-libc-dev mailing list
AVR-libc-dev@nongnu.org
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev

[avr-libc-dev] Printing octal (brain dump)

Reply via email to