Dear List: I am working to understand some differences between the results of the svymean() function in the survey package and from code I have written myself. The results from svymean() also agree with results I get from SAS proc surveymeans, so, this suggests I am misunderstanding something.
I am never comfortable with "I did what the software" does mentality, so I am working to fully replicate the methods to make sure I know what is going on and then I can feel better about software. Below I provide three things for this conversation should anyone engage. First I provide an example of R code where my code replicates svymean() exactly. Second, I provide an example where the same code is used but the results of my code and svymean() do not agree. Last, at the very bottom of this is a minimal LaTeX code that you can compile to see my derivation of the variance estimator using the taylor series expansion. My code reflects this derivation. ### Below is an example where my code and svymean agree. ### My code implements the taylor series reflected in the LaTeX below ### These data are balanced in that there are the same number of individuals per cluster ### Create correlated data set.seed(100) x <- rnorm(10) y <- rep(10,10) z <- rep(x,y) score <- rnorm(100) + z mm <- data.frame(group = gl(10,10), score) ### Use svymean to get results gg <- svydesign(ids = ~group, data = mm) svymean(~score, gg, deff='replace', na.rm=TRUE) ### My code which replicates the formulae in the LaTeX code below ## Cluster totals xx <- with(mm, tapply(score, group, sum, na.rm=TRUE)) ## Mean of the total ss <- mean(xx, na.rm=T) ## get total variance k <- length(xx) totvar <- sum((xx-ss)^2) * k/(k-1) ## The SE is then N <- nrow(mm) sqrt(totvar/N^2) One additional thing to point out is this: ## sqrt of variance of the total is: > svytotal(~score, gg) total SE score -1.7923 21.139 ## sqrt of variance of total from my code > sqrt(totvar) [1] 21.13849 ### Now, the code below presents only the output as this is from ### a data file I am working with. ### I can send data to anyone interested ### It is the same code as above, just applied to a data > gg <- svydesign(ids = ~CSRSSchoolCode, data = qq) Warning message: In svydesign.default(ids = ~CSRSSchoolCode, data = qq) : No weights or probabilities supplied, assuming equal probability > svymean(~OSPI25632, gg, deff='replace', na.rm=TRUE) mean SE DEff OSPI25632 0.84250 0.01072 1.0384 > > ### My code which replicates the formulae in the LaTeX code below > ## Cluster totals > xx <- with(qq, tapply(OSPI25632, CSRSSchoolCode, sum, na.rm=TRUE)) > ## Mean of the total > ss <- mean(xx, na.rm=T) > ## get total variance > k <- length(xx) > totvar <- sum((xx-ss)^2) * k/(k-1) > ## The SE is then > N <- nrow(qq) > sqrt(totvar/N^2) [1] 0.02158459 Now, we can see the SE's are different. But, look at the following: > svytotal(~OSPI25632, gg) total SE OSPI25632 1011 25.901 > sqrt(totvar) [1] 25.90151 The sqrt of the variance of the total is the same as what my code produces. But, because we get a different SE the only difference I think is in the denominator, which is the N^2. Now, in my data the N size is: > nrow(qq) [1] 1200 But, we can do a little algebra and see what N size is being used by svymean to get the SE as # the numerator is the sqrt of the variance of the total from svytotal and the denominator is # the se returned by svymean 25.901/.01072 [1] 2416.138 So, if we take 2416.138 as the N size I can replicate the svymean standard error. > 25.901/2416.138 [1] 0.01072 But, this is not the N size. The total N size is 1200 and the cluster size is 519 in the observed data. So, as far as I can tell, it seems svymean() is using a different N size that what I think the formula from the taylor series expansion would dictate. Now, this is what I cannot seem to reconcile. I am certain there is either another difference somewhere or a good reason why this is occuring, I just can't seem to get my head aorund the reason and would appreciate anyone with insight into this problem. SessionInfo and LaTex are below. Many thanks, Harold Below is the minimal LaTeX code for additional transparency. \documentclass[12pt]{article} \usepackage{bm,geometry} \begin{document} First define the ratio estimator of the mean as: \begin{equation} f(Y) = \frac{Y}{N} \end{equation} \noindent where $Y$ is the total of the variable and $N$ is the population size. A first-order taylor series expansion of the ratio estimator $f(Y)$ can be used to derive the design-consistent variance estimator as: \begin{equation} var(f(Y)) = \left[\frac{\partial f(Y)}{Y}\right]^2 var(Y) \end{equation} \noindent where \begin{equation} \left[\frac{\partial f(Y)}{Y}\right] = \frac{1}{N} \end{equation} \begin{equation} var(Y) = \frac{k}{k-1} \sum_{j=1}^k(\hat{Y}_j-\hat{Y}_{..})^2 \end{equation} \begin{equation} \hat{Y}_j = \sum_{i=1}^{n_j}\hat{Y}_{j(i)} \end{equation} \begin{equation} \hat{Y}_{..} = k^{-1} \sum_{j=1}^k \hat{Y}_j \end{equation} \noindent where $j$ indexes cluster $(1, 2, \ldots, k)$, $j(i)$ indexes the $i$th member of cluster $j$, and $n_j$ is the total number of members in cluster $j$. The standard error is then taken as: \begin{equation} se = \sqrt{\frac{var(f(Y))}{N^2}} \end{equation} \end{document} > sessionInfo() R version 2.7.1 (2008-06-23) i386-pc-mingw32 locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] survey_3.8 loaded via a namespace (and not attached): [1] tools_2.7.1 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.