Hi Dr. Lumley, you're obviously correct about all of that. Thank you for cluing me into it! And sorry for overlooking that part of the documentation.
I'm unfortunately still struggling with matching numbers exactly, and I foolishly provided a dataset without a weight variable - thinking there was only one issue preventing R from matching Stata precisely. Adding the df= to the confint() call seems to have only solved half of the problem. If you (or anyone else) still has any energy to look at this issue, I've pasted three analysis examples with both Stata and R code that aim to calculate a survey-adjusted confidence interval. All three examples match coefficient and standard error values precisely, for both weighted counts and percents. In the first example, the confidence intervals don't match for either the counts or the percents. In the second, the CIs match for counts but not percents. In the third, CIs match for both. Because of this erratic behavior on relatively straightforward datasets, I'm worried that Stata is making some weird (non-reproducible) calculations that would obviously be outside of the scope of this list. Thanks!! # # # # # # # # # example 1 # # # # # # # # # # simple example of confidence intervals not matching between stata and R # this example doesn't match CIs for counts or percents options( digits = 10 ) library(foreign) library(survey) x <- read.dta( " http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop.dta" ) x$pw <- 6194/310 y <- svydesign( ~1 , data = x , weights = ~pw ) svytotal( ~factor( stype ) , y ) confint( svytotal( ~factor( stype ) , y ) , df = degf( y ) ) svymean( ~factor( stype ) , y ) confint( svymean( ~factor( stype ) , y ) , df = degf( y ) ) use http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear gen pw = 6194/310 svyset [pweight=pw] svy: tabulate stype, count se ci format (%12.3f) svy: tabulate stype, percent se ci format (%12.3f) ---------------------------------------------------------- stype | count se lb ub ----------+----------------------------------------------- E | 88334.428 710.843 86940.929 89727.927 H | 15085.386 514.508 14076.773 16094.000 M | 20340.296 582.814 19197.778 21482.814 -------------------------------------------------------------- stype | percentages se lb ub ----------+--------------------------------------------------- E | 71.376 0.574 70.236 72.488 H | 12.189 0.416 11.397 13.028 M | 16.435 0.471 15.533 17.379 # # # # # # # # # example 2 # # # # # # # # # # simple example of confidence intervals not matching between stata and R # this example does match CIs for counts but does not match for percents options(digits=10) library(foreign) library(survey) # create a sample data frame with 100 records, containing values 1 - 5 and weights 1 or 2 x <- data.frame( case = 1:100 , val = 1:5 , weight = 1:2 ) # immediately save it externally to be read into stata for comparison write.dta( x , "c:/my directory/hundred record example.dta" ) # create a simple survey design, with weights only y <- svydesign( ~1 , data = x , weights = ~weight ) # the confidence interval for the weighted counts do match stata svytotal( ~I( val == 1 ) , y ) confint( svytotal( ~I( val == 1 ) , y ) , df = degf( y ) ) # generate the mean & standard error -- these two match precisely.. svymean( ~I( val == 1 ) , y ) # # # # # # # # # # # # ..but none of the confidence interval attempts match confint( svymean( ~I( val == 1 ) , y ) ) confint( svymean( ~I( val == 1 ) , y ) , df = degf( y ) ) svyciprop(~I( val == 1 ) , y, method="li" , df = degf( y ) ) svyciprop(~I( val == 1 ) , y, method="lo", df = degf( y ) ) svyciprop(~I( val == 1 ) , y, method="as", df = degf( y ) ) svyciprop(~I( val == 1 ) , y, method="be", df = degf( y ) ) # # # # # # # # # # # ************************************** * run the R code first to create this dta file use c:\my directory\hundred record example.dta , clear svyset [pweight=weight] svy: tabulate val, count se ci svy: tabulate val, percent se ci Number of strata = 1 Number of obs = 100 Number of PSUs = 100 Population size = 150 Design df = 99 ---------------------------------------------------------- val | count se lb ub ----------+----------------------------------------------- 1 | 30 6.435 17.23 42.77 2 | 30 6.435 17.23 42.77 3 | 30 6.435 17.23 42.77 4 | 30 6.435 17.23 42.77 5 | 30 6.435 17.23 42.77 Number of strata = 1 Number of obs = 100 Number of PSUs = 100 Population size = 150 Design df = 99 -------------------------------------------------------------- val | percentages se lb ub ----------+--------------------------------------------------- 1 | 20 4.238 12.88 29.72 2 | 20 4.238 12.88 29.72 3 | 20 4.238 12.88 29.72 4 | 20 4.238 12.88 29.72 5 | 20 4.238 12.88 29.72 # # # # # # # # # example 3 # # # # # # # # # # this example matches CIs for both library(foreign) library(survey) x <- read.dta( "http://www.stata-press.com/data/r11/nhanes2f.dta" ) y <- svydesign( ~1 , data = x , weights = ~finalwgt ) svymean( ~sex , y ) confint( svymean( ~sex , y ) , df = degf( y ) ) mean SE sexMale 0.47958 0.0059 sexFemale 0.52042 0.0059 2.5 % 97.5 % sexMale 0.4681058 0.4910512 sexFemale 0.5089488 0.5318942 ************************************** use http://www.stata-press.com/data/r11/nhanes2f, clear svyset [pweight=finalwgt] svy: tabulate sex, percent se ci (running tabulate on estimation sample) Number of strata = 1 Number of obs = 10337 Number of PSUs = 10337 Population size = 117023659 Design df = 10336 -------------------------------------------------------------- 1=male, | 2=female | percentages se lb ub ----------+--------------------------------------------------- Male | 47.96 .5853 46.81 49.11 Female | 52.04 .5853 50.89 53.19 On Sun, Sep 23, 2012 at 4:30 PM, Thomas Lumley <tlum...@uw.edu> wrote: > On Sat, Sep 22, 2012 at 2:51 AM, Anthony Damico <ajdam...@gmail.com> > wrote: > > > Survey: Mean estimation > > > > Number of strata = 1 Number of obs = 183 > > Number of PSUs = 15 Population size = 9235.4 > > Design df = 14 > > > > -------------------------------------------------------------- > > | Linearized > > | Mean Std. Err. [95% Conf. Interval] > > -------------+------------------------------------------------ > > ell0 | .0218579 .0226225 -.0266624 .0703783 > > -------------------------------------------------------------- > > This matches > > > svyciprop(~I(ell==0),dclus1,df=14,method="mean") > 2.5% 97.5% > I(ell == 0) 0.0218579 -0.0266624 0.07038 > > as does this > > > confint(svymean(~I(ell==0),dclus1),df=14) > 2.5 % 97.5 % > I(ell == 0)FALSE 0.92962174505 1.02666240796 > I(ell == 0)TRUE -0.02666240796 0.07037825495 > > The df= argument is not explicitly documented in ?svyciprop, but it is > in ?confint.svystat and ?svymean > > [I was slowed down a bit by the claim that the Stata intervals were > asymmetric, but in fact they aren't] > > -thomas > > -- > Thomas Lumley > Professor of Biostatistics > University of Auckland > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.