On 8/15/08, Doran, Harold <[EMAIL PROTECTED]> wrote: > 1) In this linearization, I do treat N (population) size as a known > constant. I thought that is what svymean() and SAS proc surveymeans did as > well. So, this is a simple univariate expansion since I only take the > derivative w.r.t to Y, the population total.
The sample size is the issue, and the sample size is the random variable. In many surveys, the population size may not be known at all. All that you will have is an estimator of that population size, which is total[1] in my quasi-notation. Procedurally, the latter is usually the sum of weights, and each weight is usually the inverse probability of selection. > 2) Yes, the cluster sizes do vary. I meant to mention this. But, I wasn't > sure if this was an issue or not. You can see in my first example I add in > the comment that the data are balanced. That is because I created a second > example (but didn't include it in this email) where I created an unbalanced > data set where the cluster sizes vary. But, my code and svymeans() gave the > exact same output when I ran it on the unbalanced cases as well. OK, I did not check the details, but that is strange. With unbalanced panels, off the top of my head, the linearization estimator looks like [STUFF] sum_j n_j (\bar Y_{\cdot j} - \bar Y_{\cdot \cdot})^2 where [STUFF] will do the proper scaling, something like 1/N^2. That's not your formula. It might coincide with what you've been using for some pretty special case (like constant within cluster variance which is probably what you assumed in your simulated data). Again, look up Korn and Graubard's book, they have a good discussion of this estimator. > 3) There are no weights with these data. The data I am working with are > test scores from a state. Students are clustered within schools. Entire > schools were chosen to participate in the assessment. So let me restate that: you have complete schools that were sampled randomly, right? That's a pretty rare form of design. That's actually just a one stage cluster design which I thought only exist in textbooks! You do have to take that into account, and that also addresses the next issue: > 4) I was thinking the finite population correction would not be needed in > this case, but maybe I am wrong. But if I did add in the finite population > correction, that would affect the variance of the total and I would get a > different estimate than what svymeans or SAS proc means gives and that > doesn't occur. As it stands now, my code, and the built in functions return > the same variance of the total. Your finite population correction, at least at the second level (students within schools) is 100%: you don't have any variance at all at the school level (at least design variance, see comment below). So what's left is the SRS (or whatever your sampling scheme for schools was... I would use a probability proportional to size sampling design there) of schools. You need to compute the school averages, and treat them as i.i.d. data. You can do it as is, or you can take your original data and specify the design with 100% fpc. The total[1] is still a random variable (= sample size*# schools in the population/# schools in the sample, all of which are available to you), and the variance estimator is still the first term of Taylor series linearization. Korn & Graubard give a very telling exercise/problem where they show that with heavily unbalanced panels, even if you sample all of them, you can get results that are quite notably biased. The first level fpc (1-# schools in sample/# schools in population) is still due, as I am sure that's not a negligible number. That's what the design paradigm prescribes you to do. You probably won't like the idea of zero variance, and that naturally is suspicious. What your intuition is telling you is that there is measurement error, etc. Then what goes on in your head is that you think about your results in terms of model, or superpopulation, inference, which in this case amounts to ANOVA. On model vs. design-based inference, read Binder and Roberts 2003 (http://www.citeulike.org/user/ctacmo/article/1036932). Note that svymeans is by no means built-in though. SAS PROC SVYMEANS is; Stata's -svy: mean- is, but in R, most of the stuff is user-contributed :)). See if Tom Lumley has any comments about whether his package supports 100% fpc :)) -- Stas Kolenikov, also found at http://stas.kolenikov.name Small print: I use this email account for mailing lists only. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.