Re: [R] inconsistent behavior of summary function

Brian Diggs Tue, 04 Oct 2011 14:20:29 -0700

I'm going to put on my fire suit and wade in (see inline)


On 10/4/2011 8:11 AM, Bert Gunter wrote:

On Tue, Oct 4, 2011 at 7:42 AM, Jeanne M. Spicer<xn8spi...@gmail.com>wrote:

I'm not sure how returning an incorrect result is ever a 'positive' feature


It is **not** "incorrect"; perhaps unexpected, but that is not the same.


"You are technically correct -- the best kind of correct" -- Futurama

The results (using the built-in data set rock)

> summary(rock["area"])
      area
 Min.   : 1016
 1st Qu.: 5305
 Median : 7487
 Mean   : 7188
 3rd Qu.: 8870
 Max.   :12212
> summary(rock[["area"]])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   1016    5305    7487    7188    8870   12210

differ for exactly the reason you say (dispatching to different methodsof summary), and the different values of max are both correct given thedocumentation. However, let's walk through what it takes to show that.

In the help page for summary, an option digits is described, which hasthe default value max(3, getOption("digits")-3). Executing this (orgetOption("digits") alone and doing the math) results in the defaultvalue of digits being 4 (at least for me; and I do not believe that Ihave changed the option).

So what is this option used for? In the documentation, it says:"integer, used for number formatting with signif() (for summary.default)or format() (for summary.data.frame)." Let's assume that we realizethat rock["area"] is a data frame, which would be handled bysummary.data.frame, and rock[["area"]] is a vector, and furtherdetermine that summary.default is what will handle it (having not foundsummary.vector or summary.integer).

Let's dive into the help page for signif and format, since they arelisted as relevant to the use of digits in the two different cases.

signif tells us that digits is "integer indicating the number of ...significant digits (signif) to be used." Looking at "Details", the lastsentence says "Each element of the vector is rounded individually,unlike printing." So in the case of a vector, each value is separatelyrounded to 4 significant digits (max of 12212 is rounded to 12210)

format tells us that digits is "how many significant digits are to beused for numeric and complex x. ... This is a suggestion: enough decimalplaces will be used so that the smallest (in magnitude) number has thismany significant digits, and also to satisfy nsmall."

So the difference is that if it is a vector, each part (min, quartiles,mean, and max) is rounded to 4 significant digits individually, while ifit is a column of a data frame, the set is collectively rounded so thatthe smallest has 4 significant digits and the rest are carried out tothe same decimal place.


Some points:

1) Both of these functions are in base, so I would expect the samebehavior using the same (default) arguments. Yes, the key word is"expect." Hopefully I have demonstrated that I understand why theydiffer. I would not anticipate rounding, and when only one value hasonly one digit rounded, it is not really obvious that it happened. (Ascompared to say, summary(11111*rock$area), if I knew the data was notall rounded to the nearest 10,000). So this is not just a matter ofrealizing that different methods are being dispatched, but readingthrough three different help pages (at least three, assuming I startedat the right place and realized which other two were the relevant ones)to see that the end results are presented differently WHICH I WOULD NOTREALIZE THAT I EVEN NEED TO DO.

2) rock$area is an integer vector, so even if I realize that roundingwould be done on floating point numbers, I would not expect (yes, again,"expect") that integers would need to be rounded to some lesser numberof significant digits.

3) The documentation for summary is actually wrong about digits for thecase of summary.data.frame. Consider:

> summary(rock["area"], digits=17)
      area
 Min.   : 1016.0000000000000
 1st Qu.: 5305.2500000000000
 Median : 7487.0000000000000
 Mean   : 7187.7291666700003
 3rd Qu.: 8869.5000000000000
 Max.   :12212.0000000000000

In particular, note the mean. It is wrong (mathematically incorrect ANDnot consistent with the documentation).

> dput(mean(rock["area"]))
structure(7187.72916666667, .Names = "area")

Why? Internally, summary.data.frame calls summary.default onrock[["area"]] with a hard coded digits value of 12. Then takes thisvalue, and formats it with 17 digits of precision as requested. That'swhy there are the four zeros in the middle (the last digit beingnumerical imprecision due to binary representation of floating pointvalues).

4) summary.default does not necessarily honor the number of significantdigits either:


> for(i in 1:9) print(summary(rock[["area"]], digits=i))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   1000    5000    7000    7000    9000   10000
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   1000    5300    7500    7200    8900   12000
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   1020    5310    7490    7190    8870   12200
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   1016    5305    7487    7188    8870   12210
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 1016.0  5305.2  7487.0  7187.7  8869.5 12212.0
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
 1016.00  5305.25  7487.00  7187.73  8869.50 12212.00
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
 1016.000  5305.250  7487.000  7187.729  8869.500 12212.000
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
 1016.000  5305.250  7487.000  7187.729  8869.500 12212.000
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
 1016.000  5305.250  7487.000  7187.729  8869.500 12212.000

Beyond 7, no additional significant digits are printed, despite thevalue of digits. This is the behavior of signif

> signif(mean(rock[["area"]]), digits=9)
[1] 7187.729

but is not consistent with documentation (which says digits can be aslarge as 22).

but at least the documentation could more clearly warn users that this
method behaves differently in these cases -- summary(rock[,1]) vs
summary(rock[,1:2]) -- and that the method can and *does* return incorrect
results without any warning messages.


What is (in)adequate in documentation is often in the mind of the beholder.

Note:

class(rock[,1])

[1] "integer"

class(rock[,1:2])

[1] "data.frame"

This means that different methods are dispatched, leading to the different
results. Morever,

summary(rock[,1,drop=FALSE])

       area
  Min.   : 1016
  1st Qu.: 5305
  Median : 7487
  Mean   : 7188
  3rd Qu.: 8870
  Max.   :12212

... and that is because

class(rock[,1,drop=FALSE])

[1] "data.frame"

So the relevant Help file is ?"[.data.frame"

That certainly explains the reasoning for the different dispatches, butis only the start of understanding what is going on. The data framemethod does rather what you would expect (since format tends to be lesssurprising from an output point of view). Consider another example:


> summary(11111*rock["area"])
      area
 Min.   : 11288776
 1st Qu.: 58946633
 Median : 83188057
 Mean   : 79862859
 3rd Qu.: 98549015
 Max.   :135687532
> summary(11111*rock[["area"]])
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
 11290000  58950000  83190000  79860000  98550000 135700000

Both of these have digits value of 4 (the default), but the data frameone "ignores" it (or, more accurately, format takes it as arecommendation but prints all values down to the 1's place despite only4 significant digits being requested, probably due to nsmall being 0).The default method dutifully rounds each value to the requested default4 decimal places.

I would encourage anyone teaching introductory R to look at the 'epicalc'
package.  The re-vamped function 'summ' in that package returns correct
results regardless - summ(rock), summ(rock$area).  In addition, when you
only ask for one column you not only get the correct results, you also get a
bonus distribution plot.

I'd would like all of our students to use R, but little things like this
are huge stumbling blocks for them.


I have no doubt that this is true. R is powerful, flexible and, as an
inevitable result, complex. To master it, honest effort is required,
probably a somewhat scarce commodity in introductory classes, especially for
non-statisticians. For that reason, there are numerous learning resources
available, to be found on CRAN. Have you looked at them? Moreover,there are
several R GUI's that attempt to shield the beginner from the initial shock,
to be found in the R-GUIs link under "Other Projects." Have you considered
those?

So I think something more than righteous indignation is called for here.
Nevertheless, the bottom line is that you get what you pay for: R **IS**
hard -- but for many serious data analysts of all stripes, worth the effort.

I saw it as more exasperation at inconsistencies rather than righteousindignation. There is much power in R, and there are many subtle points(to which the existence of the R Inferno attests). Certainly the morecomplicated a task is undertaken, the more subtleties are to beexpected. But to have to track subtle rounding issues for a simplesummary of a set of numbers (depending on how exactly the summary isrequested) was where I thought the frustration was coming from.

Cheers,
Bert

-jeanne


--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] inconsistent behavior of summary function

Reply via email to