Georg-Johann Lay wrote: > oh, I was just misreading this table and thought that it means yet > another 200 bytes atop of ultoa_invert (just demonstrating that > it isn't worse than ultoa_invert). > > But it appears you are intending to drop ultoa_invert which is great!
*Whew*! No, I'm comparing it because it's a *replacement* for ultoa_invert.o. I wanted smaller *and* faster *and* arbitrary-length. I was feeling very deflated by your complaints. > Really funny results (just noticed you are using TABs and therefore > the whole table is rigged ;-)) Here is the updated table (which consistently omits the string reverse) in a form that can take one level of quoting: Input Decimal Hex bits mem_toa mem_tod itoa nibbles mem_toa itoa 8 269 220 217 141 194 98 16 664 462 451 321 527 187 24 1294 783 838 608 1008 276 32 2059 1219 1167 948 1637 365 40 3059 1775 1649 1395 2414 454 48 4194 2373 2127 1895 3339 543 56 5477 3069 2733 2459 4412 632 64 6995 3822 3420 3130 5633 721 For binary bases, yu're seeing here the difference between an O(n) algorithm (89n+9 cycles) and an O(n^2) one (74*n^2 + 111*n + 9). I'm really not happy with that. For decimal, however mem_tod without a multiplier is almost fast as my decimal code *with* one, and is 80 bytes long. That's the code I'd like to use on multiplierless machines. mem_toa is 64 bytes, but the almost 2x speed difference is worth the 16 bytes, IMHO. Your u64toa_nibbles code (after I tweaked it a bit) is 90 bytes and the fastest of all. There, it's only about 75% the time of the multiplierless code, so whether the speed is worth it is more of a question. One thing that justfies it to me is that the enhanced cores tend to come with more flash, so an additional 10 bytes is affordable. Also contributing to eh afforadility is that the enhanced cores tend to generate smaller code overall. On the other hand, maintaining two completely different code paths is a bother. There's a lot to be said for just one. One alternative I mentioned earlier, which I'm thinking seriously about, is to reorganize the code into two phases: 1) Convert decimal and octal to little-endian BCD. (%x would just find the length.) 2) Print little-endian hex as ASCII. That would enlarge the code somewhat, but reduce stack usage by 11 bytes. As I noted, ROM:RAM is usually 16:1 so I could argue that those 16 bytes of RAM are "worth" 176 bytes of code, which is far more than code size increase. Note that mem_tod produces digits two at a time anyway, so it's a natural fit. I have some ideas for how to adapt the u64toa_nibbles code, and octal wouldn't be too hard. I'd really appreciate your opinion of the idea. > Often programmers are bitten by their smartness when they observe > that avr-gcc generates "some" extra instructions around the core > 64-bit arithmetic. But that's a different story... I don't quite know what you're alluding to. What frustrates me abut avr-gcc is code like time/gm_sidereal.c, where it's doing a 32x32->64 bit multiply, but if !__AVR_HAVE_MUL__, then gcc "knows" that there's no __umulsidi3 function, and generates an absolutely massive spill & fill sequence to call __muldi3, which ends up being a lot bigger & slower than a call to the __umulsidi3 wrapper which totally exists. >> That saves 90 cycles, taking it to 7206. > > Just a small speed-up, but really cool idea :-) You had some awfully cool ideas yourself, particularly integrating the length-finding into the main loop. _______________________________________________ AVR-libc-dev mailing list AVR-libc-dev@nongnu.org https://lists.nongnu.org/mailman/listinfo/avr-libc-dev