I'm going to put on my fire suit and wade in (see inline)

On 10/4/2011 8:11 AM, Bert Gunter wrote:
On Tue, Oct 4, 2011 at 7:42 AM, Jeanne M. Spicer<xn8spi...@gmail.com>wrote:

I'm not sure how returning an incorrect result is ever a 'positive' feature

It is **not** "incorrect"; perhaps unexpected, but that is not the same.


"You are technically correct -- the best kind of correct" -- Futurama

The results (using the built-in data set rock)

> summary(rock["area"])
      area
 Min.   : 1016
 1st Qu.: 5305
 Median : 7487
 Mean   : 7188
 3rd Qu.: 8870
 Max.   :12212
> summary(rock[["area"]])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   1016    5305    7487    7188    8870   12210

differ for exactly the reason you say (dispatching to different methods of summary), and the different values of max are both correct given the documentation. However, let's walk through what it takes to show that.

In the help page for summary, an option digits is described, which has the default value max(3, getOption("digits")-3). Executing this (or getOption("digits") alone and doing the math) results in the default value of digits being 4 (at least for me; and I do not believe that I have changed the option).

So what is this option used for? In the documentation, it says: "integer, used for number formatting with signif() (for summary.default) or format() (for summary.data.frame)." Let's assume that we realize that rock["area"] is a data frame, which would be handled by summary.data.frame, and rock[["area"]] is a vector, and further determine that summary.default is what will handle it (having not found summary.vector or summary.integer).

Let's dive into the help page for signif and format, since they are listed as relevant to the use of digits in the two different cases.

signif tells us that digits is "integer indicating the number of ... significant digits (signif) to be used." Looking at "Details", the last sentence says "Each element of the vector is rounded individually, unlike printing." So in the case of a vector, each value is separately rounded to 4 significant digits (max of 12212 is rounded to 12210)

format tells us that digits is "how many significant digits are to be used for numeric and complex x. ... This is a suggestion: enough decimal places will be used so that the smallest (in magnitude) number has this many significant digits, and also to satisfy nsmall."

So the difference is that if it is a vector, each part (min, quartiles, mean, and max) is rounded to 4 significant digits individually, while if it is a column of a data frame, the set is collectively rounded so that the smallest has 4 significant digits and the rest are carried out to the same decimal place.

Some points:

1) Both of these functions are in base, so I would expect the same behavior using the same (default) arguments. Yes, the key word is "expect." Hopefully I have demonstrated that I understand why they differ. I would not anticipate rounding, and when only one value has only one digit rounded, it is not really obvious that it happened. (As compared to say, summary(11111*rock$area), if I knew the data was not all rounded to the nearest 10,000). So this is not just a matter of realizing that different methods are being dispatched, but reading through three different help pages (at least three, assuming I started at the right place and realized which other two were the relevant ones) to see that the end results are presented differently WHICH I WOULD NOT REALIZE THAT I EVEN NEED TO DO.

2) rock$area is an integer vector, so even if I realize that rounding would be done on floating point numbers, I would not expect (yes, again, "expect") that integers would need to be rounded to some lesser number of significant digits.

3) The documentation for summary is actually wrong about digits for the case of summary.data.frame. Consider:
> summary(rock["area"], digits=17)
      area
 Min.   : 1016.0000000000000
 1st Qu.: 5305.2500000000000
 Median : 7487.0000000000000
 Mean   : 7187.7291666700003
 3rd Qu.: 8869.5000000000000
 Max.   :12212.0000000000000

In particular, note the mean. It is wrong (mathematically incorrect AND not consistent with the documentation).
> dput(mean(rock["area"]))
structure(7187.72916666667, .Names = "area")

Why? Internally, summary.data.frame calls summary.default on rock[["area"]] with a hard coded digits value of 12. Then takes this value, and formats it with 17 digits of precision as requested. That's why there are the four zeros in the middle (the last digit being numerical imprecision due to binary representation of floating point values).

4) summary.default does not necessarily honor the number of significant digits either:

> for(i in 1:9) print(summary(rock[["area"]], digits=i))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   1000    5000    7000    7000    9000   10000
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   1000    5300    7500    7200    8900   12000
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   1020    5310    7490    7190    8870   12200
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   1016    5305    7487    7188    8870   12210
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 1016.0  5305.2  7487.0  7187.7  8869.5 12212.0
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
 1016.00  5305.25  7487.00  7187.73  8869.50 12212.00
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
 1016.000  5305.250  7487.000  7187.729  8869.500 12212.000
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
 1016.000  5305.250  7487.000  7187.729  8869.500 12212.000
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
 1016.000  5305.250  7487.000  7187.729  8869.500 12212.000

Beyond 7, no additional significant digits are printed, despite the value of digits. This is the behavior of signif
> signif(mean(rock[["area"]]), digits=9)
[1] 7187.729
but is not consistent with documentation (which says digits can be as large as 22).

but at least the documentation could more clearly warn users that this
method behaves differently in these cases -- summary(rock[,1]) vs
summary(rock[,1:2]) -- and that the method can and *does* return incorrect
results without any warning messages.


What is (in)adequate in documentation is often in the mind of the beholder.

Note:
class(rock[,1])
[1] "integer"

class(rock[,1:2])
[1] "data.frame"

This means that different methods are dispatched, leading to the different
results. Morever,
summary(rock[,1,drop=FALSE])
       area
  Min.   : 1016
  1st Qu.: 5305
  Median : 7487
  Mean   : 7188
  3rd Qu.: 8870
  Max.   :12212

... and that is because
class(rock[,1,drop=FALSE])
[1] "data.frame"

So the relevant Help file is ?"[.data.frame"

That certainly explains the reasoning for the different dispatches, but is only the start of understanding what is going on. The data frame method does rather what you would expect (since format tends to be less surprising from an output point of view). Consider another example:

> summary(11111*rock["area"])
      area
 Min.   : 11288776
 1st Qu.: 58946633
 Median : 83188057
 Mean   : 79862859
 3rd Qu.: 98549015
 Max.   :135687532
> summary(11111*rock[["area"]])
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
 11290000  58950000  83190000  79860000  98550000 135700000

Both of these have digits value of 4 (the default), but the data frame one "ignores" it (or, more accurately, format takes it as a recommendation but prints all values down to the 1's place despite only 4 significant digits being requested, probably due to nsmall being 0). The default method dutifully rounds each value to the requested default 4 decimal places.

I would encourage anyone teaching introductory R to look at the 'epicalc'
package.  The re-vamped function 'summ' in that package returns correct
results regardless - summ(rock), summ(rock$area).  In addition, when you
only ask for one column you not only get the correct results, you also get a
bonus distribution plot.

I'd would like all of our students to use R, but little things like this
are huge stumbling blocks for them.


I have no doubt that this is true. R is powerful, flexible and, as an
inevitable result, complex. To master it, honest effort is required,
probably a somewhat scarce commodity in introductory classes, especially for
non-statisticians. For that reason, there are numerous learning resources
available, to be found on CRAN. Have you looked at them? Moreover,there are
several R GUI's that attempt to shield the beginner from the initial shock,
to be found in the R-GUIs link under "Other Projects." Have you considered
those?

So I think something more than righteous indignation is called for here.
Nevertheless, the bottom line is that you get what you pay for: R **IS**
hard -- but for many serious data analysts of all stripes, worth the effort.

I saw it as more exasperation at inconsistencies rather than righteous indignation. There is much power in R, and there are many subtle points (to which the existence of the R Inferno attests). Certainly the more complicated a task is undertaken, the more subtleties are to be expected. But to have to track subtle rounding issues for a simple summary of a set of numbers (depending on how exactly the summary is requested) was where I thought the frustration was coming from.

Cheers,
Bert

-jeanne


--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to