http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51441
Bug #: 51441 Summary: Incorrect FP rounding on addition of doubles Classification: Unclassified Product: gcc Version: 4.4.3 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassig...@gcc.gnu.org ReportedBy: exponent-b...@yandex.ru A well-known sentence of IEEE 754 reads: "Each of the computational operations that return a numeric result specified by this standard shall be performed as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that intermediate result, if necessary, to fit in the destination's format..." Given two operands in IEEE's binary32 format: 0x1.002p0 (C99 hexadecimal notation) and 1.6331239353195372e+16 I believe the result of addition of the operands under the to-nearest rounding mode shall be: 1.6331239353195374e+16 whereas GCC yields: 1.6331239353195372e+16 (note the difference in the last sign of the mantissa). The analysis is this: the difference of exponents of the operands is 53 bits which is also the width of mantissa in the binary32 format, including the implied most significant raised bit. Thus, mantissa of the intermediate result of the addition should be result of concatenation of mantissas of the operands: mantissa of 1.6331239353195372e+16 followed by mantissa of 0x1.002p0. That is, the most significant mantissa's bit of 0x1.002p0 is the first bit that shall be rounded away to fit the destination's format. We know that first bit is raised and we know there is one more bit raised in the mantissa, so under the to-nearest rounding mode the intermediate result shall be rounded to the next representable number that is not less than the intermediate result, which is 1.6331239353195374e+16. The test code: ---- #include <stdlib.h> #include <stdio.h> double a, b, r; int main(void) { a = 0x1.002p0; b = 1.6331239353195372e+16; r = a + b; printf("%+.16e + %+.16e = %+.16e\n", a, b, r); return EXIT_SUCCESS; } ---- Compilation flags: gcc --float-store -g -O0 -o test test.c Actual output: +1.0004882812500000e+00 + +1.6331239353195372e+16 = +1.6331239353195372e+16 Expected output: +1.0004882812500000e+00 + +1.6331239353195372e+16 = +1.6331239353195374e+16 Compiler version: gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5) 4.4.3