Re: [R] Exactly Replicating Stata's Survey Data Confidence Intervals in R

Anthony Damico Mon, 24 Sep 2012 10:58:45 -0700

Hi Dr. Lumley, you're obviously correct about all of that.  Thank you for
cluing me into it!  And sorry for overlooking that part of the
documentation.


I'm unfortunately still struggling with matching numbers exactly, and I
foolishly provided a dataset without a weight variable - thinking there was
only one issue preventing R from matching Stata precisely.  Adding the df=
to the confint() call seems to have only solved half of the problem.

If you (or anyone else) still has any energy to look at this issue, I've
pasted three analysis examples with both Stata and R code that aim to
calculate a survey-adjusted confidence interval.

All three examples match coefficient and standard error values precisely,
for both weighted counts and percents.  In the first example, the
confidence intervals don't match for either the counts or the percents.  In
the second, the CIs match for counts but not percents.  In the third, CIs
match for both.  Because of this erratic behavior on relatively
straightforward datasets, I'm worried that Stata is making some weird
(non-reproducible) calculations that would obviously be outside of the
scope of this list.

Thanks!!


# # # # # # # #
# example 1 #
# # # # # # # #

# simple example of confidence intervals not matching between stata and R
# this example doesn't match CIs for counts or percents

options( digits = 10 )
library(foreign)
library(survey)
x <- read.dta( "
http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop.dta"; )
x$pw <- 6194/310
y <- svydesign( ~1 , data = x , weights = ~pw )

svytotal( ~factor( stype ) , y )
confint( svytotal( ~factor( stype ) , y ) , df = degf( y ) )

svymean( ~factor( stype ) , y )
confint( svymean( ~factor( stype ) , y ) , df = degf( y ) )





use http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop,
clear
gen pw = 6194/310
svyset [pweight=pw]
svy: tabulate stype, count se ci format (%12.3f)
svy: tabulate stype, percent se ci format (%12.3f)


----------------------------------------------------------
    stype |      count          se          lb          ub
----------+-----------------------------------------------
        E |  88334.428     710.843   86940.929   89727.927
        H |  15085.386     514.508   14076.773   16094.000
        M |  20340.296     582.814   19197.778   21482.814


--------------------------------------------------------------
    stype | percentages           se           lb           ub
----------+---------------------------------------------------
        E |      71.376        0.574       70.236       72.488
        H |      12.189        0.416       11.397       13.028
        M |      16.435        0.471       15.533       17.379




# # # # # # # #
# example 2 #
# # # # # # # #



# simple example of confidence intervals not matching between stata and R
# this example does match CIs for counts but does not match for percents

options(digits=10)
library(foreign)
library(survey)

# create a sample data frame with 100 records, containing values 1 - 5 and
weights 1 or 2
x <- data.frame( case = 1:100 , val = 1:5 , weight = 1:2 )
# immediately save it externally to be read into stata for comparison
write.dta( x , "c:/my directory/hundred record example.dta" )

# create a simple survey design, with weights only
y <- svydesign( ~1 , data = x , weights = ~weight )

# the confidence interval for the weighted counts do match stata
svytotal( ~I( val == 1 ) , y )
confint( svytotal( ~I( val == 1 ) , y ) , df = degf( y ) )

# generate the mean & standard error -- these two match precisely..
svymean( ~I( val == 1 ) , y )

# # # # # # # # # # #

# ..but none of the confidence interval attempts match
confint( svymean( ~I( val == 1 ) , y ) )
confint( svymean( ~I( val == 1 ) , y ) , df = degf( y ) )
svyciprop(~I( val == 1 ) , y, method="li" , df = degf( y ) )
svyciprop(~I( val == 1 ) , y, method="lo", df = degf( y ) )
svyciprop(~I( val == 1 ) , y, method="as", df = degf( y ) )
svyciprop(~I( val == 1 ) , y, method="be", df = degf( y ) )

# # # # # # # # # # #


**************************************

* run the R code first to create this dta file
use c:\my directory\hundred record example.dta , clear
svyset [pweight=weight]
svy: tabulate val, count se ci
svy: tabulate val, percent se ci

Number of strata   =         1                  Number of obs      =
100
Number of PSUs     =       100                  Population size    =
150
                                                Design df          =
99

----------------------------------------------------------
      val |      count          se          lb          ub
----------+-----------------------------------------------
        1 |         30       6.435       17.23       42.77
        2 |         30       6.435       17.23       42.77
        3 |         30       6.435       17.23       42.77
        4 |         30       6.435       17.23       42.77
        5 |         30       6.435       17.23       42.77


Number of strata   =         1                  Number of obs      =
100
Number of PSUs     =       100                  Population size    =
150
                                                Design df          =
99

--------------------------------------------------------------
      val | percentages           se           lb           ub
----------+---------------------------------------------------
        1 |          20        4.238        12.88        29.72
        2 |          20        4.238        12.88        29.72
        3 |          20        4.238        12.88        29.72
        4 |          20        4.238        12.88        29.72
        5 |          20        4.238        12.88        29.72





# # # # # # # #
# example 3 #
# # # # # # # #


# this example matches CIs for both

library(foreign)
library(survey)

x <- read.dta( "http://www.stata-press.com/data/r11/nhanes2f.dta"; )
y <- svydesign( ~1 , data = x , weights = ~finalwgt )
svymean( ~sex , y )
confint( svymean( ~sex , y ) , df = degf( y ) )

             mean     SE
sexMale   0.47958 0.0059
sexFemale 0.52042 0.0059
              2.5 %    97.5 %
sexMale   0.4681058 0.4910512
sexFemale 0.5089488 0.5318942

**************************************

use http://www.stata-press.com/data/r11/nhanes2f, clear
svyset [pweight=finalwgt]
svy: tabulate sex, percent se ci
(running tabulate on estimation sample)

Number of strata   =         1                 Number of obs      =
10337
Number of PSUs     =     10337                 Population size    =
117023659
                                               Design df          =
10336

--------------------------------------------------------------
1=male,   |
2=female  | percentages           se           lb           ub
----------+---------------------------------------------------
     Male |       47.96        .5853        46.81        49.11
   Female |       52.04        .5853        50.89        53.19






On Sun, Sep 23, 2012 at 4:30 PM, Thomas Lumley <tlum...@uw.edu> wrote:

> On Sat, Sep 22, 2012 at 2:51 AM, Anthony Damico <ajdam...@gmail.com>
> wrote:
>
> > Survey: Mean estimation
> >
> > Number of strata =       1          Number of obs    =     183
> > Number of PSUs   =      15          Population size  =  9235.4
> >                                     Design df        =      14
> >
> > --------------------------------------------------------------
> >              |             Linearized
> >              |       Mean   Std. Err.     [95% Conf. Interval]
> > -------------+------------------------------------------------
> >         ell0 |   .0218579   .0226225     -.0266624    .0703783
> > --------------------------------------------------------------
>
> This matches
>
> > svyciprop(~I(ell==0),dclus1,df=14,method="mean")
>                              2.5%   97.5%
> I(ell == 0)  0.0218579 -0.0266624 0.07038
>
> as does this
>
> > confint(svymean(~I(ell==0),dclus1),df=14)
>                           2.5 %        97.5 %
> I(ell == 0)FALSE  0.92962174505 1.02666240796
> I(ell == 0)TRUE  -0.02666240796 0.07037825495
>
> The df= argument is not explicitly documented in ?svyciprop, but it is
> in ?confint.svystat and ?svymean
>
> [I was slowed down a bit by the claim that the Stata intervals were
> asymmetric, but in fact they aren't]
>
>    -thomas
>
> --
> Thomas Lumley
> Professor of Biostatistics
> University of Auckland
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Exactly Replicating Stata's Survey Data Confidence Intervals in R

Reply via email to