Re: [R] significance test interquartile ranges

peter dalgaard Sun, 15 Jul 2012 03:26:27 -0700

On Jul 14, 2012, at 19:58 , Schaber, Jörg wrote:

> Dear Peter,
> 
> thanks for your clarifications. Sample size is around 200 in each group. 
> Would that justify your approach?


It's certainly better than 10... 

I did a small check on the IgM data from the ISwR package (298 obs.) and found 
something somewhat amusing: Discretization effects can kick in rather 
profoundly with data sets of that magnitude. 

The IgM data are discretized to 1 decimal digit, which is fairly common for 
"continuous" data in practice

> table(IgM)
IgM
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9   1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8   2 2.1 
  3   7  19  27  32  35  38  38  22  16  16   6   7   9   6   2   3   3   3   2 
2.2 2.5 2.7 4.5 
  1   1   1   1 
> summary(IgM)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.100   0.500   0.700   0.803   1.000   4.500 
> IQR(IgM)
[1] 0.5

However, if we want to look at the sample distribution of a quantile, we get 
some curious effects as the variation of the estimate is close to the 
discretization error. Try a simple bootstrap sample from the empirical CDF:

> medians <- replicate(10000,median(sample(IgM,replace=T)))
> table(medians)
medians
 0.6 0.65  0.7 0.75  0.8 
  13    6 9035  179  767 

However, if we smoothen the empirical CDF by adding a little noise, we do get 
something that does look passably (although not perfectly) gaussian:

> x <- IgM + runif(IgM, -.05,.05)
> medians2 <- replicate(10000,median(sample(x,replace=T)))
> hist(medians2)
> qqnorm(medians2)

Interestingly, adding noise has the counterintuitive effect of reducing the 
standard error of the medians:

> sd(medians)
[1] 0.02748966
> sd(medians2)
[1] 0.02347363

(It's not _that_ counterintuitive given that the definition of the median isn't 
quite the same for discrete data.)

Back to the IQR. You can do much the same thing:

> iqrs <- replicate(10000,IQR(sample(IgM,replace=T)))
> table(iqrs)
iqrs
  0.3 0.375   0.4  0.45 0.475   0.5  0.55 0.575   0.6 
   60    42  3885     7   640  5100     3    87   176 

or, use the smoothed one replacing IgM by x (defined above).

Now, what if we wanted to compare two IQRs? I'll cheat and reuse the same ECDF 
for both groups.

> i1 <- replicate(10000,IQR(sample(IgM,replace=T)))
> i2 <- replicate(10000,IQR(sample(IgM,replace=T)))
> qqnorm((i1-i2)/sd(i1-i2))
> mean(abs(i1-i2)/sd(i1-i2) < 2)
[1] 0.9698

So, not really all that bad, but it is a bit fortuitous given the discreteness 
of the distribution.

Same thing with the x comes out quite a bit nicer

> ix1 <- replicate(10000,IQR(sample(x,replace=T)))
> ix2 <- replicate(10000,IQR(sample(x,replace=T)))
> qqnorm((ix1-ix2)/sd(ix1-ix2))
> mean(abs(ix1-ix2)/sd(ix1-ix2) < 2)
[1] 0.9546

So, my conclusion would be that yes, you can use bootstrap techniques with data 
of that size, but you need to watch out for discretization effects by checking 
the bootstrap sample distributions and you might want to add a little 
smoothing-noise for stability. 

As always with bootstrapping, beware that the simulation is never done under 
the null hypothesis, one merely hopes that the distribution of the resampled 
estimates around the observed estimate is sufficiently similar to that of the 
estimator around the true estimate that it can be used for tests and confidence 
intervals, implicitly using a location-shift argument. This gets particularly 
dubious when there are discretization effects because the jumps occur at values 
that do not depend on the parameters. 

(Pragmatically speaking, you might not be interested at all in differences in 
IQR which are comparable to discretization error, though.) 


> 
> I found a couple of more tests for scale on continous variables, ie. 
> Mood Test
> Ansari-Bradley Test (that one is also implemented in R)
> Klotz Test
> Conover Test
> 
> Would one of those be suitable to test for different dispersion (e.g. IQR or 
> the like) in non-normal distributions?
> 

That is what they were designed to do... I'm not all that well acquainted with 
them, but given what I have seen from that general area and period, they should 
likely be studied with a critical eye to hidden assumptions. Quite a lot of 
work has been published with the general structure of "let's do some sensible 
transformations of data and apply a nonparametric test, then call the whole 
procedure assumption-free" (in those days, 1950s and 1960s, essentially, 
computer simulations were not readily available to show people the error of 
their ways...).

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd....@cbs.dk  Priv: pda...@gmail.com

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] significance test interquartile ranges

Reply via email to