Some day ago I tried to adapt Sean's fixed point patch [1] resp. the abandoned attempt [2] to avr-gcc 4.8. A current version thereof can be seen in [3], but there are still several issues.
One issue is the precision of multiplication. A round to nearest would come up with a rounding error of up to 0.5 LSBs compared to the correct result. A round up / round down / round to zero would come up with an error less than 1 LSB. ISO/IEC TR18037 allows a rounding error up to 2 LSBs. The current implementation of [U]SAmode ([Un]signed Single Accum) has an error of more than 2 LSBs. The reason is that the least significant bytes are not taken into account in the arithmetic and thus the respective carries are lost. One example is 0xff.01ff * 0xff.01ff where the product if off by 0x2.fc LSBs which is very close to 3 LSBs. This leads to a premature accumulation of computation errors and poor conditioned arithmetic. There are basically 2 ways to perform an operation: 1) Use all parts of the involved operands and adjust the result to yield the desired format. 2) Only perform sub-computations whose result will overlap the destination. Carries from LSBs will be lost. Approach 1) leads to the exact result so that the algorithms can be sure to round the /exact/ result to get the rounded result (as required by TR18037 for instance). Approach 2 is faster but does not offer control over rounding errors and cannot be used if a saturated result is needed. The above rounding problem can be fixed by a 64 = 32 * 32 widening multiplication, but the code would run slower and need more flash. The current implementation in lib1funcs-fixed.S: __mulusa3: 104B __mulsa3 : 108B A widening multiply will need about 15 instructions more: __umulsidi3: 74B __umulhisi3: 30B __muldi3_6: 18B The latter two routines can be shared with other code: __umulhisi3 is likely shared, __muldi3_6 unlikely. If there will ever be an optimized version of the saturated counterparts, then they will also use __umulsidi3 et al. What is the best approach here? Code that is slower, might consume more flash and stack but complies to TR18037? Or fast code whose rounding behavior if not withing the 2 LSBs as of TR18037? A code that uses both signed and unsigned versions will consume ~200 bytes for the multiplications alone. Sign extension can be performed in three different ways: 1) Explicit before the computation 2) Implicit during the computation 3) Explicit after the computation [3] currently uses 2) but could reuse the unsigned version and then consumes 22 bytes by means of 3) like so: DEFUN __mulsa3 XCALL __mulusa3 tst B3 brpl 1f sub C2, A0 sbc C3, A1 1: sbrs A3, 7 ret sub C2, B0 sbc C3, B1 ret ENDF __mulsa3 Thus, if both the signed and the unsigned versions are needed, the code size will go down by more than 80 bytes. If only the signed version is used, code size goes up by 20 bytes. What's the best here? Johann -- [1] http://distribute.atmel.no/tools/opensource/avr-gcc/gcc-4.5.1/30-gcc-4.5.1-fixedpoint-3-4-2010.patch [2] http://lists.gnu.org/archive/html/avr-gcc-list/2012-07/msg00030.html [3] http://gcc.gnu.org/ml/gcc-patches/2012-08/msg00632.html _______________________________________________ AVR-GCC-list mailing list AVR-GCC-list@nongnu.org https://lists.nongnu.org/mailman/listinfo/avr-gcc-list