Re: [R] Dealing with proportions

David Winsemius Thu, 06 Oct 2011 07:35:11 -0700


On Oct 5, 2011, at 4:08 PM, Sam wrote:

Dear list,
I have very little experience in dealing with proportions, i am surethis is a very simple question

Sometimes making that claim in a group of statisticians can provoke anextended discussion to which there will be no eventual agreement.Discussions about testing proportions in groups with small numbers isone such excellent example.

but i could find no suitable answer beyond doing a chi-sq test andthen using the Marascuilo procedure as a post-hoc analysis.
I am simply wanting to know if the proportions ( i.e the number ofYes / No) significantly differ between the cases and if so whichcases are significantly high or low?
proportion <- structure(list(Case = structure(1:11, .Label = c("A","B", "C","D", "E", "F", "G", "H", "I", "J", "K"), class = "factor"), Yes =c(18L,
2L, 1L, 2L, 44L, 27L, 2L, 15L, 13L, 3L, 34L), No = c(171L, 11L,
5L, 8L, 146L, 80L, 5L, 30L, 22L, 5L, 42L), Num = c(189L, 13L,
6L, 10L, 190L, 107L, 7L, 45L, 35L, 8L, 76L)), .Names = c("Case",
"Yes", "No", "Num"), class = "data.frame", row.names = c(NA,
-11L))

I would probably have turned to logistic regression and used p.adjustto deal with the multiple comparisons issues. By default, theIntercept p-value estimate for a simple model with a factor variableas the predictor tests for the Intercept = 0 tests for of aproportion to 0.50 (since that is the proportion at which thelog(odds) = 0). It then tests for equality to that odds for all theother levels (the same as odds ratios being 1 or theircoefficients=0). (The first way I set this up I ended up with your A-level being the reference and the hypotheses being tested didn't seemreasonable or "interesting.... namely that the lowest proportion wasequal to 50% and that all the rest of the proportions were equal tothat lowest proportion.)


as.matrix(proportion$Yes/proportion$Num)
           [,1]
 [1,] 0.0952381
 [2,] 0.1538462
 [3,] 0.1666667
 [4,] 0.2000000
 [5,] 0.2315789
 [6,] 0.2523364
 [7,] 0.2857143
 [8,] 0.3333333
 [9,] 0.3714286
[10,] 0.3750000
[11,] 0.4473684

So I took your data and made your "K" level the reference value sinceits value (0.44) is the one closest to 50% and then looked at the p-values from the others as information about their similarity to the Klevel.


proportion$Case <- relevel(proportion$Case, "K")

casefit <- glm(cbind(Yes, Num) ~ Case, data= proportion,family="binomial")

summary(casefit)
summary(casefit)$coefficients
p.adjust( summary(casefit)$coefficients[ , 4] )

That set of hypotheses gives very few "significant" results, but mostof your groups are pretty sparse so that should be expected. Another(and I think better) approach resulting in even fewer "significant"results is to use logistic regression with an offset term equal to thelog(odds) for the entire group:


log(  with(proportion, sum(Yes)/sum(No) ) )   # returns [1] -1.181994
This corresponds to an expected proportion of 0.2346939

casefit <- glm(cbind(Yes, Num) ~ 0+ Case, data= proportion,

offset=rep( log( with(proportion, sum(Yes)/sum(No) ) ), 11), # each level gets the same offset.

               family="binomial")

I used the 0+ on the RHS of the formula to force a labeled estimatefor each level, since the p-values are now in reference to the overallproportion. Since only one level has a "significant" p-value and it ishighly significant (due to the combination of higher numbers of"events" and "distance" from the NULL estimate) you will not see muchchange if you adjust its p-value.


cbind( round( summary(casefit)$coefficients, 3),
       Prpns =round( (proportion$Yes/proportion$Num)[c(11,1:10)], 3),
       N_Yes=proportion$Yes[c(11,1:10)] )

      Estimate Std. Error z value Pr(>|z|) Prpns N_Yes
CaseK    0.378      0.206   1.830    0.067 0.447    34
CaseA   -1.169      0.247  -4.741    0.000 0.095    18
CaseB   -0.690      0.760  -0.908    0.364 0.154     2
CaseC   -0.610      1.080  -0.565    0.572 0.167     1
CaseD   -0.427      0.775  -0.552    0.581 0.200     2
CaseE   -0.281      0.167  -1.679    0.093 0.232    44
CaseF   -0.195      0.215  -0.905    0.365 0.252    27
CaseG   -0.071      0.802  -0.088    0.930 0.286     2
CaseH    0.083      0.298   0.280    0.780 0.333    15
CaseI    0.192      0.325   0.590    0.555 0.371    13
CaseJ    0.201      0.677   0.297    0.766 0.375     3

Comparing the numbers and the observed proportions to the overallproportion (0.2346939) shows not many (only one actually) groups withstrong evidence of "being different." I think you need larger numbers.


--
David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Dealing with proportions

Reply via email to