Re: [Rd] bug in sum() on integer vector

Hervé Pagès Wed, 14 Dec 2011 17:51:51 -0800

Hi Peter,

On 11-12-14 08:19 AM, peter dalgaard wrote:


On Dec 14, 2011, at 16:19 , John C Nash wrote:


Following this thread, I wondered why nobody tried cumsum to see where the 
integer
overflow occurs. On the shorter xx vector in the little script below I get a 
message:

Warning message:
Integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'


But sum() does not give such a warning, which I believe is the point of 
contention. Since
cumsum() does manage to give such a warning, and show where the overflow 
occurs, should
sum() not be able to do so? For the record, I don't class the non-zero answer 
as an error
in itself. I regard the failure to warn as the issue.


It (sum) does warn if you take the two "halves" separately. The issue is that 
the overflow is detected at the end of the summation, when the result is to be saved to 
an integer (which of course happens for all intermediate sums in cumsum)

x<- c(rep(1800000003L, 10000000), -rep(1200000002L, 15000000))
sum(x[1:10000000])

[1] NA
Warning message:
In sum(x[1:1e+07]) : Integer overflow - use sum(as.numeric(.))

sum(x[10000001:25000000])

[1] NA
Warning message:
In sum(x[10000001:1.5e+07]) : Integer overflow - use sum(as.numeric(.))

sum(x)

[1] 4996000

There's a pretty easy fix, essentially to move

     if(s>  INT_MAX || s<  R_INT_MIN){
         warningcall(call, _("Integer overflow - use sum(as.numeric(.))"));
         *value = NA_INTEGER;
     }

inside the summation loop. Obviously, there's a speed penalty from two FP 
comparisons per element, but I wouldn't know whether it matters in practice for 
anyone.


Since you want to generate this warning once only, your test (now
inside the loop) needs to be something like:

    if (warn && (s > INT_MAX || s < R_INT_MIN)) {
        generate the warning
        warn = 0;
    }

with 'warn' initialized to 1. This makes the isum() function almost
twice slower on my machine (64-bit Ubuntu) when compiling with
gcc -O2 and when no overflow occurs (the most common use case I guess).

Why not just do the sum in a long double instead of a double?
It slows down isum() by only 8% on my machine when compiling
with gcc -O2.
But most importantly this solution also has the advantage of making
sum(x) consistent with sum(as.double(x)). The latter uses rsum() which
does the sum in a long double. So by using a long double in both isum()
and rsum(), consistency between sum(x) and sum(as.double(x)) is
guaranteed.
Maybe that still doesn't give you the guarantee that sum(x) will always
return the correct value (when it does not return NA) because that
depends now on the ability of long double to represent exactly the sum
of at most INT_MAX arbitrary ints. The nb of bits used for long double
seems to vary a lot across platforms/compilers so it's hard to tell.
Not an ideal solution, but at least it makes isum() more accurate than
the current isum() and it makes sum(x) consistent with sum(as.double(x))
on all platforms, without degrading performance too much.

Cheers,
H.

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] bug in sum() on integer vector

Reply via email to