Dear Peter and Tim,

Thank you very much for taking the time to explain this to me! It is much
more clear now.
And sorry for using the space here maybe inappropriately, I really hope this
is OK and gets posted, I think it is really important that non-statisticians
like myself get a good idea of the concepts behind the functions of R. I am
really grateful you went through this with me.

-f

2011/10/9 Tim Hesterberg <timhesterb...@gmail.com>

> I'll concur with Peter Dalgaard that
> * a permutation test is the right thing to do - your problem is equivalent
>  to a two-sample test,
> * don't bootstrap, and
> * don't bother with t-statistics
> but I'll elaborate a bit on on why, including
> * two approaches to the whole problem - and how your approach relates
>  to the usual approach,
> * an interesting tidbit about resampling t statistics.
>
> First, I'm assuming that your x variable is irrelevant, only y matters,
> and that sample1 is a proper subset of df.  I'll also assume that you
> want to look for differences in the mean, rather than arbitrary differences
> (in which case you might use e.g. a Kolmogorov-Smirnov test).
>
> There are two general approaches to this problem:
> (1) two-sample problem, sample1$y vs df$y[rows other than sample 1]
> (2) the approach you outlined, thinking of sample1$y as part of df$y.
>
> Consider (1), and call the two data sets y1 and y2
> The basic permutation test approach is
> * compute the test statistic theta(y1, y2), e.g. mean(y1)-mean(y2)
> * repeat 9999 (or 99999) times:
>  draw a sample of size n1 from the pooled data, call that Y1, call the rest
> Y2
>  compute theta(Y1, Y2)
> * P-value for a one-sided test is (1 + k) / (1 + 9999)
>  where k is the number of permutation samples with theta(Y1,Y2) >=
> theta(y1,y2)
>
> The test statistic could be
>  mean(y1) - mean(y2)
>  mean(y1)
>  sum(y1)
>  t-statistic (pooled variance)
>  P-value for a t-test (pooled variance)
>  mean(y1) - mean(pooled data)
>  t-statistic (unpooled variance)
>  P-value for a t-test (unpooled variance)
>  median(y1) - median(y2)
>  ...
> The first six of those are equivalent - they give exactly the same P-value
> for the permutation test.  The reason is that those test statistics
> are monotone transformations of each other, given the data.
> Hence, doing the pooled-variance t calculations gains nothing.
>
> Now consider your approach (2).  That is equivalent to the permutation
> test using the test statistic:  mean(y1) - mean(pooled data).
>
> Why not a bootstrap?  E.g. pool the data and draw samples of size
> n1 and n2 from the pooled data, independently and with replacement.
> That is similar to the permutation test, but less accurate.  Probably
> the easiest way to see this is to suppose there is 1 outlier in the pooled
> data.
> In any permutation iteration there is exactly 1 outlier among the two
> samples.
> With bootstrapping, there could be 0, 1, 2, ....
> The permutation test answers the question - given that there is exactly
> 1 outlier in my combined data, what is the probability that random chance
> would give a difference as large as I observed.  The bootstrap would
> answer some other question.
>
> Tim Hesterberg
> NEW!  Mathematical Statistics with Resampling and R, Chihara & Hesterberg
>
> http://www.amazon.com/Mathematical-Statistics-Resampling-Laura-Chihara/dp/1118029852/ref=sr_1_1?ie=UTF8
> http://home.comcast.net/~timhesterberg
>  (resampling, water bottle rockets, computers to Guatemala, shower = 2650
> light bulbs, ...)
>
>
> >On Oct 8, 2011, at 16:04 , francy wrote:
> >
> >> Hi,
> >>
> >> I am having trouble understanding how to approach a simulation:
> >>
> >> I have a sample of n=250 from a population of N=2,000 individuals, and I
> >> would like to use either permutation test or bootstrap to test whether
> this
> >> particular sample is significantly different from the values of any
> other
> >> random samples of the same population. I thought I needed to take random
> >> samples (but I am not sure how many simulations I need to do) of n=250
> from
> >> the N=2,000 population and maybe do a one-sample t-test to compare the
> mean
> >> score of all the simulated samples, + the one sample I am trying to
> prove
> >> that is different from any others, to the mean value of the population.
> But
> >> I don't know:
> >> (1) whether this one-sample t-test would be the right way to do it, and
> how
> >> to go about doing this in R
> >> (2) whether a permutation test or bootstrap methods are more appropriate
> >>
> >> This is the data frame that I have, which is to be sampled:
> >> df<-
> >> i.e.
> >> x y
> >> 1 2
> >> 3 4
> >> 5 6
> >> 7 8
> >> . .
> >> . .
> >> . .
> >> 2,000
> >>
> >> I have this sample from df, and would like to test whether it is has
> extreme
> >> values of y.
> >> sample1<-
> >> i.e.
> >> x y
> >> 3 4
> >> 7 8
> >> . .
> >> . .
> >> . .
> >> 250
> >>
> >> For now I only have this:
> >>
> >> R=999 #Number of simulations, but I don't know how many...
> >> t.values =numeric(R)  #creates a numeric vector with 999 elements, which
> >> will hold the results of each simulation.
> >> for (i in 1:R) {
> >> sample1 <- df[sample(nrow(df), 250, replace=TRUE),]
> >>
> >> But I don't know how to continue the loop: do I calculate the mean for
> each
> >> simulation and compare it to the population mean?
> >> Any help you could give me would be very appreciated,
> >> Thank you.
> >
> >The straightforward way would be a permutation test, something like this
> >
> >msamp <- mean(sample1$y)
> >mpop <- mean(df$y)
> >msim <- replicate(10000, mean(sample(df$y, 250)))
> >
> >sum(abs(msim-mpop) >= abs(msamp-mpop))/10000
> >
> >I don't really see a reason to do bootstrapping here. You say you want to
> test whether your sample could be a random sample from the population, so
> just simulate that sampling (which should be without replacement, like your
> sample is). Bootstrapping might come in if you want a confidence interval
> for the mean difference between your sample and the rest.
> >
> >Instead of sampling means, you could put a full-blown t-test inside the
> replicate expression, like:
> >
> >psim <- replicate(10000, {s<-sample(1:2000, 250); t.test(df$y[s],
> df$y[-s])$p.value})
> >
> >and then check whether the p value for your sample is small compared to
> the distribution of values in psim.
> >
> >That'll take quite a bit longer, though; t.test() is a more complex beast
> than mean(). It is not obvious that it has any benefits either, unless you
> specifically wanted to investigate the behavior of the t test.
> >
> >(All code untested. Caveat emptor.)
> >
> >
> >--
> >Peter Dalgaard, Professor,
> >Center for Statistics, Copenhagen Business School
> >Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> >Phone: (+45)38153501
> >Email: pd....@cbs.dk  Priv: pda...@gmail.com
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to