http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51441

             Bug #: 51441
           Summary: Incorrect FP rounding on addition of doubles
    Classification: Unclassified
           Product: gcc
           Version: 4.4.3
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
        AssignedTo: unassig...@gcc.gnu.org
        ReportedBy: exponent-b...@yandex.ru


A well-known sentence of IEEE 754 reads: "Each of the computational operations
that return a numeric result specified by this standard shall be performed as
if it first produced an intermediate result correct to infinite precision and
with unbounded range, and then rounded that intermediate result, if necessary,
to fit in the destination's format..."

Given two operands in IEEE's binary32 format:
    0x1.002p0  (C99 hexadecimal notation)
and
    1.6331239353195372e+16
I believe the result of addition of the operands under the to-nearest rounding
mode shall be:
    1.6331239353195374e+16
whereas GCC yields:
    1.6331239353195372e+16
(note the difference in the last sign of the mantissa).

The analysis is this: the difference of exponents of the operands is 53 bits
which is also the width of mantissa in the binary32 format, including the
implied most significant raised bit. Thus, mantissa of the intermediate result
of the addition should be result of concatenation of mantissas of the operands:
mantissa of 1.6331239353195372e+16 followed by mantissa of 0x1.002p0. That is,
the most significant mantissa's bit of 0x1.002p0 is the first bit that shall be
rounded away to fit the destination's format. We know that first bit is raised
and we know there is one more bit raised in the mantissa, so under the
to-nearest rounding mode the intermediate result shall be rounded to the next
representable number that is not less than the intermediate result, which is
1.6331239353195374e+16.

The test code:
----
#include <stdlib.h>
#include <stdio.h>

double a, b, r;

int main(void)
{
    a = 0x1.002p0;
    b = 1.6331239353195372e+16;

    r = a + b;

    printf("%+.16e + %+.16e = %+.16e\n", a, b, r);

    return EXIT_SUCCESS;
}
----

Compilation flags:
gcc --float-store -g -O0 -o test test.c

Actual output:
+1.0004882812500000e+00 + +1.6331239353195372e+16 = +1.6331239353195372e+16

Expected output:
+1.0004882812500000e+00 + +1.6331239353195372e+16 = +1.6331239353195374e+16

Compiler version:
gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5) 4.4.3

Reply via email to