[R-sig-eco] Regression with few observations per factor level
Hi, I would like to test the impact of a treatment of some variable using regression (e.g. lm(var ~ trt + cov)). However I only have four observations per factor level. Is it still possible to apply a regression with such a small sample size. I think that i should be difficult to correctly estimate variance.Do you think that I rather should compute a non-parametric test such as Kruskal-Wallis? However I need to include covariables in my models and I am not sure if basic non-parametric tests are suitable for this. Thanks for any suggestion. ___ Mode, hifi, maison,… J'achète malin. Je compare les prix avec [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
Re: [R-sig-eco] Regression with few observations per factor level
I think you can, but the confidence intervals will be rather large due to number of samples. Notice how standard errors change for sample size (per group) from 4 to 30. > pg <- 4 # pg = per group > my.df <- data.frame(var = c(rnorm(pg, mean = 3), rnorm(pg, mean = 1), rnorm(pg, mean = 11), rnorm(pg, mean = 30)), + trt = rep(c("trt1", "trt2", "trt3", "trt4"), each = pg), + cov = runif(pg*4)) # 4 groups > summary(lm(var ~ trt + cov, data = my.df)) Call: lm(formula = var ~ trt + cov, data = my.df) Residuals: Min 1Q Median 3Q Max -1.63861 -0.46080 0.03332 0.66380 1.27974 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.2345 1.0218 1.2080.252 trttrt2 -0.7759 0.8667 -0.8950.390 trttrt3 7.8503 0.8308 9.449 1.3e-06 *** trttrt4 28.2685 0.9050 31.236 4.3e-12 *** cov 1.4027 1.1639 1.2050.253 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.154 on 11 degrees of freedom Multiple R-squared: 0.9932, Adjusted R-squared: 0.9908 F-statistic: 404.4 on 4 and 11 DF, p-value: 7.467e-12 > > pg <- 30 # pg = per group > my.df <- data.frame(var = c(rnorm(pg, mean = 3), rnorm(pg, mean = 1), rnorm(pg, mean = 11), rnorm(pg, mean = 30)), + trt = rep(c("trt1", "trt2", "trt3", "trt4"), each = pg), + cov = runif(pg*4)) # 4 groups > summary(lm(var ~ trt + cov, data = my.df)) Call: lm(formula = var ~ trt + cov, data = my.df) Residuals: Min 1Q Median 3Q Max -2.5778 -0.6584 -0.0185 0.6423 3.2077 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.769610.25232 10.977 < 2e-16 *** trttrt2 -1.754900.28546 -6.148 1.17e-08 *** trttrt3 8.405210.28251 29.752 < 2e-16 *** trttrt4 27.040950.28286 95.599 < 2e-16 *** cov 0.051290.32523 0.1580.875 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.094 on 115 degrees of freedom Multiple R-squared: 0.9913, Adjusted R-squared: 0.991 F-statistic: 3269 on 4 and 115 DF, p-value: < 2.2e-16 On Mon, Oct 20, 2014 at 10:53 AM, V. Coudrain wrote: > Hi, I would like to test the impact of a treatment of some variable using > regression (e.g. lm(var ~ trt + cov)). However I only have four > observations per factor level. Is it still possible to apply a regression > with such a small sample size. I think that i should be difficult to > correctly estimate variance.Do you think that I rather should compute a > non-parametric test such as Kruskal-Wallis? However I need to include > covariables in my models and I am not sure if basic non-parametric tests > are suitable for this. Thanks for any suggestion. > ___ > Mode, hifi, maison,… J'achète malin. Je compare les prix avec > [[alternative HTML version deleted]] > > ___ > R-sig-ecology mailing list > R-sig-ecology@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-sig-ecology > -- In God we trust, all others bring data. [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
Re: [R-sig-eco] Regression with few observations per factor level
Thank you very much. If I get it right, the CI get wider, my test has less power and the probability of getting a significant relation decreases. What about the significant coefficients, are they reliable? > Message du 20/10/14 à 11h30 > De : "Roman Luštrik" > A : "V. Coudrain" > Copie à : "r-sig-ecology@r-project.org" > Objet : Re: [R-sig-eco] Regression with few observations per factor level > > I think you can, but the confidence intervals will be rather large due to > number of samples. > Notice how standard errors change for sample size (per group) from 4 to 30. > > pg <- 4 # pg = per group> my.df <- data.frame(var = c(rnorm(pg, mean = 3), > > rnorm(pg, mean = 1), rnorm(pg, mean = 11), rnorm(pg, mean = 30)), + > > trt = rep(c("trt1", "trt2", "trt3", "trt4"), each = pg), + > > cov = runif(pg*4)) # 4 groups> summary(lm(var ~ trt + cov, > > data = my.df)) > Call:lm(formula = var ~ trt + cov, data = my.df) > Residuals: Min 1Q Median 3Q Max -1.63861 -0.46080 > 0.03332 0.66380 1.27974 > Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) > 1.2345 1.0218 1.208 0.252 trttrt2 -0.7759 0.8667 > -0.895 0.390 trttrt3 7.8503 0.8308 9.449 1.3e-06 > ***trttrt4 28.2685 0.9050 31.236 4.3e-12 ***cov 1.4027 > 1.1639 1.205 0.253 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ > 0.05 ‘.’ 0.1 ‘ ’ 1 > Residual standard error: 1.154 on 11 degrees of freedomMultiple R-squared: > 0.9932,Adjusted R-squared: 0.9908 F-statistic: 404.4 on 4 and 11 DF, > p-value: 7.467e-12 > > > pg <- 30 # pg = per group> my.df <- data.frame(var = c(rnorm(pg, mean = > >3), rnorm(pg, mean = 1), rnorm(pg, mean = 11), rnorm(pg, mean = 30)), + > > trt = rep(c("trt1", "trt2", "trt3", "trt4"), each = pg), + > > cov = runif(pg*4)) # 4 groups> summary(lm(var ~ trt + cov, > >data = my.df)) > Call:lm(formula = var ~ trt + cov, data = my.df) > Residuals: Min 1Q Median 3Q Max -2.5778 -0.6584 -0.0185 > 0.6423 3.2077 > Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) > 2.76961 0.25232 10.977 < 2e-16 ***trttrt2 -1.75490 0.28546 > -6.148 1.17e-08 ***trttrt3 8.40521 0.28251 29.752 < 2e-16 > ***trttrt4 27.04095 0.28286 95.599 < 2e-16 ***cov 0.05129 > 0.32523 0.158 0.875 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ > 0.05 ‘.’ 0.1 ‘ ’ 1 > Residual standard error: 1.094 on 115 degrees of freedomMultiple R-squared: > 0.9913,Adjusted R-squared: 0.991 F-statistic: 3269 on 4 and 115 DF, > p-value: < 2.2e-16 > On Mon, Oct 20, 2014 at 10:53 AM, V. Coudrain wrote: > Hi, I would like to test the impact of a treatment of some variable using > regression (e.g. lm(var ~ trt + cov)). However I only have four observations > per factor level. Is it still possible to apply a regression with such a > small sample size. I think that i should be difficult to correctly estimate > variance.Do you think that I rather should compute a non-parametric test such > as Kruskal-Wallis? However I need to include covariables in my models and I > am not sure if basic non-parametric tests are suitable for this. Thanks for > any suggestion. > ___ > Mode, hifi, maison,… J'achète malin. Je compare les prix avec > [[alternative HTML version deleted]] > > ___ > R-sig-ecology mailing list > R-sig-ecology@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-sig-ecology > > > -- > In God we trust, all others bring data. ___ Mode, hifi, maison,… J'achète malin. Je compare les prix avec [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
Re: [R-sig-eco] Regression with few observations per factor level
Hi, coefficients and their p-values are reliable if your data are OK and you do know enough about the process that generated them, so you can choose appropriate model. With 4 points per line, it may be really difficult to identify bad fit or outliers. For example: simple linear regression needs constant variance of the normal distribution from which residuals are drawn - along the regression line - to work properly. With 4 points, you can hardly estimate this, but if you know enough about the process that generated the data, you are safe. If you do not know, it is not easy to say anything about the nature of the process that generated the data. If you know (or can assume) that there is simple linear relationship, you can say: "slope of this relationship is such and such", but if you want to estimate both the nature of the relationship ("A *linearly* depends on B") and its magnitude ("the slope of this relationship is ..."), p-values would not help you much. Of course, I may be wrong - I am not a statistician, just a user. Best, Martin W. V. Coudrain píše v Po 20. 10. 2014 v 13:37 +0200: > Thank you very much. If I get it right, the CI get wider, my test has less > power and the probability of getting a significant relation decreases. What > about the significant coefficients, are they reliable? > > > > > > Message du 20/10/14 à 11h30 > > De : "Roman Luštrik" > > A : "V. Coudrain" > > Copie à : "r-sig-ecology@r-project.org" > > Objet : Re: [R-sig-eco] Regression with few observations per factor level > > > > I think you can, but the confidence intervals will be rather large due to > > number of samples. > > Notice how standard errors change for sample size (per group) from 4 to 30. > > > pg <- 4 # pg = per group> my.df <- data.frame(var = c(rnorm(pg, mean = > > > 3), rnorm(pg, mean = 1), rnorm(pg, mean = 11), rnorm(pg, mean = 30)), + > > > trt = rep(c("trt1", "trt2", "trt3", "trt4"), each = > > > pg), + cov = runif(pg*4)) # 4 groups> summary(lm(var > > > ~ trt + cov, data = my.df)) > > Call:lm(formula = var ~ trt + cov, data = my.df) > > Residuals: Min 1Q Median 3Q Max -1.63861 -0.46080 > > 0.03332 0.66380 1.27974 > > Coefficients:Estimate Std. Error t value Pr(>|t|) > > (Intercept) 1.2345 1.0218 1.2080.252trttrt2 -0.7759 > > 0.8667 -0.8950.390trttrt3 7.8503 0.8308 9.449 > > 1.3e-06 ***trttrt4 28.2685 0.9050 31.236 4.3e-12 ***cov > > 1.4027 1.1639 1.2050.253---Signif. codes: 0 ‘***’ 0.001 > > ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > Residual standard error: 1.154 on 11 degrees of freedomMultiple R-squared: > > 0.9932,Adjusted R-squared: 0.9908 F-statistic: 404.4 on 4 and 11 DF, > > p-value: 7.467e-12 > > > > pg <- 30 # pg = per group> my.df <- data.frame(var = c(rnorm(pg, mean = > > > > 3), rnorm(pg, mean = 1), rnorm(pg, mean = 11), rnorm(pg, mean = 30)), + > > > > trt = rep(c("trt1", "trt2", "trt3", "trt4"), each = > > > > pg), + cov = runif(pg*4)) # 4 groups> > > > > summary(lm(var ~ trt + cov, data = my.df)) > > Call:lm(formula = var ~ trt + cov, data = my.df) > > Residuals:Min 1Q Median 3Q Max -2.5778 -0.6584 -0.0185 > > 0.6423 3.2077 > > Coefficients:Estimate Std. Error t value Pr(>|t|) > > (Intercept) 2.769610.25232 10.977 < 2e-16 ***trttrt2 -1.75490 > > 0.28546 -6.148 1.17e-08 ***trttrt3 8.405210.28251 29.752 < > > 2e-16 ***trttrt4 27.040950.28286 95.599 < 2e-16 ***cov > > 0.051290.32523 0.1580.875---Signif. codes: 0 ‘***’ 0.001 > > ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > Residual standard error: 1.094 on 115 degrees of freedomMultiple R-squared: > > 0.9913,Adjusted R-squared: 0.991 F-statistic: 3269 on 4 and 115 DF, > > p-value: < 2.2e-16 > > On Mon, Oct 20, 2014 at 10:53 AM, V. Coudrain wrote: > > Hi, I would like to test the impact of a treatment of some variable using > > regression (e.g. lm(var ~ trt + cov)). However I only have four > > observations per factor level. Is it still possible to apply a regression > > with such a small sample size. I think that i should be difficult to > > correctly estimate variance.Do you think that I rather should compute a > > non-parametric test such as Kruskal-Wallis? However I need to include > > covariables in my models and I am not sure if basic non-parametric tests > > are suitable for this. Thanks for any suggestion. > > ___ > > Mode, hifi, maison,… J'achète malin. Je compare les prix avec > > [[alternative HTML version deleted]] > > > > ___ > > R-sig-ecology mailing list > > R-sig-ecology@r-project.org > > https://stat.ethz.ch/mailman/listinfo/r-sig-ecology > > > > > > > -- > > In God we trust
[R-sig-eco] Logistic regression with 2 categorical predictors
Hi Listers, I am trying to run a logistic regression to look at the effects of experiment type and age on the behavior of fish in a choice chamber experiment. I am using the glm approach and would like some advice on how or whether to perform contrasts to work out what levels of Factor1 (Age) and Factor 2 (Test) are significantly different from each other. I have not been able to clarify from my reading what is the appropriate approach to take when dealing with a significant interaction term. I am also not sure as to how one interprets a model when all the coefficients are non-significant but the chi-square ANOVA shows a highly significant interaction term. I have graphed up the data as dot plots and there is definitely evidence of changes in proportions in later ages. I want to provide evidence for when and for which tests there was a 'significant' change in behavior. > snapper2 age test prefer avoid 11 LR 1514 21 SD 1513 31 SG 1714 41 SW 1414 52 LR 1714 62 SD 1619 72 SG 2010 82 SW 1521 93 LR 1016 10 3 SD 1410 11 3 SG 14 9 12 3 SW 1315 13 4 LR 1211 14 4 SD 1411 15 4 SG 1312 16 4 SW 1114 17 5 LR 412 18 5 SD 8 8 19 5 SG 018 20 5 SW 10 6 21 6 LR 0 6 22 6 SD 3 4 23 6 SG 0 5 24 6 SW 5 3 > dotplot(age~prefer/avoid,data=snapper2,group=snapper2$test,cex=1.5,pch=19,ylab="age",auto.key=list(space="right",title="Tests")) > out2 <- glm(cbind(prefer,avoid) ~ age*test, data=snapper2, family=binomial) > summary(out2) Call: glm(formula = cbind(prefer, avoid) ~ age * test, family = binomial, data = snapper2) Deviance Residuals: [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 6.899e-02 3.716e-01 0.186 0.8527 age2 1.252e-01 5.180e-01 0.242 0.8091 age3-5.390e-01 5.483e-01 -0.983 0.3256 age4 1.802e-02 5.589e-01 0.032 0.9743 age5-1.168e+00 6.866e-01 -1.701 0.0890 . age6-2.575e+01 9.348e+04 0.000 0.9998 testSD 7.411e-02 5.307e-01 0.140 0.8890 testSG 1.252e-01 5.180e-01 0.242 0.8091 testSW -6.899e-02 5.301e-01 -0.130 0.8964 age2:testSD -4.401e-01 7.260e-01 -0.606 0.5444 age3:testSD 7.324e-01 7.846e-01 0.933 0.3506 age4:testSD 8.004e-02 7.863e-01 0.102 0.9189 age5:testSD 1.024e+00 9.301e-01 1.102 0.2707 age6:testSD 2.532e+01 9.348e+04 0.000 0.9998 age2:testSG 3.738e-01 7.407e-01 0.505 0.6138 age3:testSG 7.867e-01 7.832e-01 1.004 0.3152 age4:testSG -1.321e-01 7.764e-01 -0.170 0.8649 age5:testSG -2.568e+01 8.768e+04 0.000 0.9998 age6:testSG 2.121e-02 1.334e+05 0.000 1. age2:testSW -4.616e-01 7.249e-01 -0.637 0.5242 age3:testSW 3.959e-01 7.662e-01 0.517 0.6054 age4:testSW -2.592e-01 7.858e-01 -0.330 0.7415 age5:testSW 1.678e+00 9.386e-01 1.788 0.0737 . age6:testSW 2.626e+01 9.348e+04 0.000 0.9998 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 5.4908e+01 on 23 degrees of freedom Residual deviance: 2.6113e-10 on 0 degrees of freedom AIC: 122.73 Number of Fisher Scoring iterations: 23 > anova(out2, test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: cbind(prefer, avoid) Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL23 54.908 age 5 11.23518 43.673 0.0469115 * test 31.59315 42.079 0.6608887 age:test 15 42.079 0 0.000 0.0002185 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 cheers Andy [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
Re: [R-sig-eco] Regression with few observations per factor level
You are more or less preforming an ANOVA/ANCOVA on your data? As pointed out earlier, all of the normal theory regression assumptions apply. Assuming all of those things are satisfied then if you have large confidence intervals and there are significant differences between groups I don't see why you couldn't correctly infer something about the treatments. Maybe I am missing something. Stephen On Mon, Oct 20, 2014 at 8:43 AM, Martin Weiser wrote: > Hi, > > coefficients and their p-values are reliable if your data are OK and you > do know enough about the process that generated them, so you can choose > appropriate model. With 4 points per line, it may be really difficult to > identify bad fit or outliers. > > For example: simple linear regression needs constant variance of the > normal distribution from which residuals are drawn - along the > regression line - to work properly. With 4 points, you can hardly > estimate this, but if you know enough about the process that generated > the data, you are safe. If you do not know, it is not easy to say > anything about the nature of the process that generated the data. > > If you know (or can assume) that there is simple linear relationship, > you can say: "slope of this relationship is such and such", but if you > want to estimate both the nature of the relationship ("A *linearly* > depends on B") and its magnitude ("the slope of this relationship > is ..."), p-values would not help you much. > > Of course, I may be wrong - I am not a statistician, just a user. > > Best, > Martin W. > > > V. Coudrain píše v Po 20. 10. 2014 v 13:37 +0200: > > Thank you very much. If I get it right, the CI get wider, my test has > less power and the probability of getting a significant relation decreases. > What about the significant coefficients, are they reliable? > > > > > > > > > > > Message du 20/10/14 à 11h30 > > > De : "Roman Luštrik" > > > A : "V. Coudrain" > > > Copie à : "r-sig-ecology@r-project.org" > > > Objet : Re: [R-sig-eco] Regression with few observations per factor > level > > > > > > I think you can, but the confidence intervals will be rather large due > to number of samples. > > > Notice how standard errors change for sample size (per group) from 4 > to 30. > > > > pg <- 4 # pg = per group> my.df <- data.frame(var = c(rnorm(pg, mean > = 3), rnorm(pg, mean = 1), rnorm(pg, mean = 11), rnorm(pg, mean = 30)), + >trt = rep(c("trt1", "trt2", "trt3", "trt4"), each = pg), > + cov = runif(pg*4)) # 4 groups> summary(lm(var ~ trt + > cov, data = my.df)) > > > Call:lm(formula = var ~ trt + cov, data = my.df) > > > Residuals: Min 1Q Median 3Q Max -1.63861 > -0.46080 0.03332 0.66380 1.27974 > > > Coefficients:Estimate Std. Error t value Pr(>|t|) > (Intercept) 1.2345 1.0218 1.2080.252trttrt2 -0.7759 > 0.8667 -0.8950.390trttrt3 7.8503 0.8308 9.449 > 1.3e-06 ***trttrt4 28.2685 0.9050 31.236 4.3e-12 ***cov > 1.4027 1.1639 1.2050.253---Signif. codes: 0 ‘***’ 0.001 > ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > > Residual standard error: 1.154 on 11 degrees of freedomMultiple > R-squared: 0.9932,Adjusted R-squared: 0.9908 F-statistic: 404.4 on 4 and > 11 DF, p-value: 7.467e-12 > > > > > pg <- 30 # pg = per group> my.df <- data.frame(var = c(rnorm(pg, > mean = 3), rnorm(pg, mean = 1), rnorm(pg, mean = 11), rnorm(pg, mean = > 30)), + trt = rep(c("trt1", "trt2", "trt3", "trt4"), > each = pg), + cov = runif(pg*4)) # 4 groups> > summary(lm(var ~ trt + cov, data = my.df)) > > > Call:lm(formula = var ~ trt + cov, data = my.df) > > > Residuals:Min 1Q Median 3Q Max -2.5778 -0.6584 > -0.0185 0.6423 3.2077 > > > Coefficients:Estimate Std. Error t value Pr(>|t|) > (Intercept) 2.769610.25232 10.977 < 2e-16 ***trttrt2 -1.75490 > 0.28546 -6.148 1.17e-08 ***trttrt3 8.405210.28251 29.752 < > 2e-16 ***trttrt4 27.040950.28286 95.599 < 2e-16 ***cov > 0.051290.32523 0.1580.875---Signif. codes: 0 ‘***’ 0.001 > ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > > Residual standard error: 1.094 on 115 degrees of freedomMultiple > R-squared: 0.9913,Adjusted R-squared: 0.991 F-statistic: 3269 on 4 and > 115 DF, p-value: < 2.2e-16 > > > On Mon, Oct 20, 2014 at 10:53 AM, V. Coudrain wrote: > > > Hi, I would like to test the impact of a treatment of some variable > using regression (e.g. lm(var ~ trt + cov)). However I only have four > observations per factor level. Is it still possible to apply a regression > with such a small sample size. I think that i should be difficult to > correctly estimate variance.Do you think that I rather should compute a > non-parametric test such as Kruskal-Wallis? However I need to include > covariables in my models and I am not sure if basic non-parametric tests > are suitable for this. Thanks for any suggestion. > > >
Re: [R-sig-eco] Logistic regression with 2 categorical predictors
Dear Andrew, anova() and summary() test different hypotheses. anova() tests is at least one level is different from the others. summary() tests if the coefficient is different from zero. Multiple comparison of different interaction levels is probably the most relevant in this case. The easiest way is to make a new variable. snapper2$inter <- with(snapper2, interaction(age, test)) model <- glm(cbind(prefer,avoid) ~ 0 + inter, data=snapper2, family=binomial) library(multcomp) mc <- glht(model, mcp(inter = "Tukey)) summary(mc) Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium + 32 2 525 02 51 + 32 54 43 61 85 thierry.onkel...@inbo.be www.inbo.be To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey -Oorspronkelijk bericht- Van: r-sig-ecology-boun...@r-project.org [mailto:r-sig-ecology-boun...@r-project.org] Namens Andrew Halford Verzonden: maandag 20 oktober 2014 16:06 Aan: r-sig-ecology@r-project.org Onderwerp: [R-sig-eco] Logistic regression with 2 categorical predictors Hi Listers, I am trying to run a logistic regression to look at the effects of experiment type and age on the behavior of fish in a choice chamber experiment. I am using the glm approach and would like some advice on how or whether to perform contrasts to work out what levels of Factor1 (Age) and Factor 2 (Test) are significantly different from each other. I have not been able to clarify from my reading what is the appropriate approach to take when dealing with a significant interaction term. I am also not sure as to how one interprets a model when all the coefficients are non-significant but the chi-square ANOVA shows a highly significant interaction term. I have graphed up the data as dot plots and there is definitely evidence of changes in proportions in later ages. I want to provide evidence for when and for which tests there was a 'significant' change in behavior. > snapper2 age test prefer avoid 11 LR 1514 21 SD 1513 31 SG 1714 41 SW 1414 52 LR 1714 62 SD 1619 72 SG 2010 82 SW 1521 93 LR 1016 10 3 SD 1410 11 3 SG 14 9 12 3 SW 1315 13 4 LR 1211 14 4 SD 1411 15 4 SG 1312 16 4 SW 1114 17 5 LR 412 18 5 SD 8 8 19 5 SG 018 20 5 SW 10 6 21 6 LR 0 6 22 6 SD 3 4 23 6 SG 0 5 24 6 SW 5 3 > dotplot(age~prefer/avoid,data=snapper2,group=snapper2$test,cex=1.5,pch=19,ylab="age",auto.key=list(space="right",title="Tests")) > out2 <- glm(cbind(prefer,avoid) ~ age*test, data=snapper2, family=binomial) > summary(out2) Call: glm(formula = cbind(prefer, avoid) ~ age * test, family = binomial, data = snapper2) Deviance Residuals: [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 6.899e-02 3.716e-01 0.186 0.8527 age2 1.252e-01 5.180e-01 0.242 0.8091 age3-5.390e-01 5.483e-01 -0.983 0.3256 age4 1.802e-02 5.589e-01 0.032 0.9743 age5-1.168e+00 6.866e-01 -1.701 0.0890 . age6-2.575e+01 9.348e+04 0.000 0.9998 testSD 7.411e-02 5.307e-01 0.140 0.8890 testSG 1.252e-01 5.180e-01 0.242 0.8091 testSW -6.899e-02 5.301e-01 -0.130 0.8964 age2:testSD -4.401e-01 7.260e-01 -0.606 0.5444 age3:testSD 7.324e-01 7.846e-01 0.933 0.3506 age4:testSD 8.004e-02 7.863e-01 0.102 0.9189 age5:testSD 1.024e+00 9.301e-01 1.102 0.2707 age6:testSD 2.532e+01 9.348e+04 0.000 0.9998 age2:testSG 3.738e-01 7.407e-01 0.505 0.6138 age3:testSG 7.867e-01 7.832e-01 1.004 0.3152 age4:testSG -1.321e-01 7.764e-01 -0.170 0.8649 age5:testSG -2.568e+01 8.768e+04 0.000 0.9998 age6:testSG 2.121e-02 1.334e+05 0.000 1. age2:testSW -4.616e-01 7.249e-01 -0.637 0.5242 age3:testSW 3.959e-01 7.662e-01 0.517 0.6054 age4:testSW -2.592e-01 7.858e-01 -0.330 0.7415 age5:testSW 1.678e+00 9.386e-01 1.788 0.0737 . age6:testSW 2.626e+01 9.348e+04 0.000 0.9998 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 5.4908e+01 on 23 degrees of freedom Residual devi
Re: [R-sig-eco] Regression with few observations per factor level
Thank you for this helpful thought. So if I get it correctly it is hopeless to try testing an interaction, but we neverless may assess if a covariate has an impact, providing it is the same in all treatments. > Message du 20/10/14 à 16h46 > De : "Elgin Perry" > A : v_coudr...@voila.fr > Copie à : > Objet : Regression with few observations per factor level > > If it is reasonable to assume that the slope of the covariate is the same for > all treatments and you have numerous treatments then you can do this by > specifying one slope parameter for all treatments as you gave in your example > (e.g. lm(var ~ trt + cov)). By combining slope information over treatments, > you can obtain a reasonably precise estimate. With so few observations per > treatment, you will not be able to estimate separate slopes for each > treatment with any degree of precision (e.g. lm(var ~ trt + trt:cov)) Elgin S. Perry, Ph.D. Statistics Consultant 377 Resolutions Rd. Colonial Beach, Va. 22443 ph. 410.610.1473 Date: Mon, 20 Oct 2014 10:53:41 +0200 (CEST) From: "V. Coudrain" < v_coudr...@voila.fr > To: r-sig-ecology@r-project.org Subject: [R-sig-eco] Regression with few observations per factor level Message-ID: < 2127199056.738451413795221981.JavaMail.www@wwinf7128 > Content-Type: text/plain; charset="UTF-8" Hi, I would like to test the impact of a treatment of some variable using regression (e.g. lm(var ~ trt + cov)).? However I only have four observations per factor level. Is it still possible to apply a regression with such a small sample size. I think that i should be difficult to correctly estimate variance.Do you think that I rather should compute a non-parametric test such as Kruskal-Wallis? However I need to include covariables in my models and I am not sure if basic non-parametric tests are suitable for this. Thanks for any suggestion. ___ Mode, hifi, maison,? J'ach?te malin. Je compare les prix avec [[alternative HTML version deleted]] ___ Mode, hifi, maison,… J'achète malin. Je compare les prix avec [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
Re: [R-sig-eco] Regression with few observations per factor level
Yes, but as I fear, the residuals behave badly as soon as the model get a little bit more complex (e.g., with two covariables or an interactions). The scope for performing an ANCOVA is thus very limited. That's why I was thinking about a potential non-parametric model. But I do not want to artificially makes my data tell something if it cannot. > Message du 20/10/14 à 16h50 > De : "stephen sefick" > A : "Martin Weiser" > Copie à : "V. Coudrain" , "r-sig-ecology" > Objet : Re: [R-sig-eco] Regression with few observations per factor level > > You are more or less preforming an ANOVA/ANCOVA on your data? As pointed out > earlier, all of the normal theory regression assumptions apply. Assuming all > of those things are satisfied then if you have large confidence intervals and > there are significant differences between groups I don't see why you couldn't > correctly infer something about the treatments. Maybe I am missing something. > Stephen > On Mon, Oct 20, 2014 at 8:43 AM, Martin Weiser wrote: > Hi, > > coefficients and their p-values are reliable if your data are OK and you > do know enough about the process that generated them, so you can choose > appropriate model. With 4 points per line, it may be really difficult to > identify bad fit or outliers. > > For example: simple linear regression needs constant variance of the > normal distribution from which residuals are drawn - along the > regression line - to work properly. With 4 points, you can hardly > estimate this, but if you know enough about the process that generated > the data, you are safe. If you do not know, it is not easy to say > anything about the nature of the process that generated the data. > > If you know (or can assume) that there is simple linear relationship, > you can say: "slope of this relationship is such and such", but if you > want to estimate both the nature of the relationship ("A *linearly* > depends on B") and its magnitude ("the slope of this relationship > is ..."), p-values would not help you much. > > Of course, I may be wrong - I am not a statistician, just a user. > > Best, > Martin W. > > > V. Coudrain píše v Po 20. 10. 2014 v 13:37 +0200: > > Thank you very much. If I get it right, the CI get wider, my test has less > > power and the probability of getting a significant relation decreases. What > > about the significant coefficients, are they reliable? > > > > > > > > > > > Message du 20/10/14 à 11h30 > > > De : "Roman Luštrik" > > > A : "V. Coudrain" > > > Copie à : "r-sig-ecology@r-project.org" > > > Objet : Re: [R-sig-eco] Regression with few observations per factor level > > > > > > I think you can, but the confidence intervals will be rather large due to > > > number of samples. > > > Notice how standard errors change for sample size (per group) from 4 to > > > 30. > > > > pg <- 4 # pg = per group> my.df <- data.frame(var = c(rnorm(pg, mean = > > > > 3), rnorm(pg, mean = 1), rnorm(pg, mean = 11), rnorm(pg, mean = 30)), + > > > > trt = rep(c("trt1", "trt2", "trt3", "trt4"), each = > > > > pg), + cov = runif(pg*4)) # 4 groups> > > > > summary(lm(var ~ trt + cov, data = my.df)) > > > Call:lm(formula = var ~ trt + cov, data = my.df) > > > Residuals: Min 1Q Median 3Q Max -1.63861 -0.46080 > > > 0.03332 0.66380 1.27974 > > > Coefficients: Estimate Std. Error t value Pr(>|t|) > > > (Intercept) 1.2345 1.0218 1.208 0.252 trttrt2 -0.7759 > > > 0.8667 -0.895 0.390 trttrt3 7.8503 0.8308 9.449 > > > 1.3e-06 ***trttrt4 28.2685 0.9050 31.236 4.3e-12 ***cov > > > 1.4027 1.1639 1.205 0.253 ---Signif. codes: 0 ‘***’ 0.001 > > > ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > > Residual standard error: 1.154 on 11 degrees of freedomMultiple > > > R-squared: 0.9932,Adjusted R-squared: 0.9908 F-statistic: 404.4 on 4 > > > and 11 DF, p-value: 7.467e-12 > > > > > pg <- 30 # pg = per group> my.df <- data.frame(var = c(rnorm(pg, mean > > > > > = 3), rnorm(pg, mean = 1), rnorm(pg, mean = 11), rnorm(pg, mean = > > > > > 30)), + trt = rep(c("trt1", "trt2", "trt3", > > > > > "trt4"), each = pg), + cov = runif(pg*4)) # 4 > > > > > groups> summary(lm(var ~ trt + cov, data = my.df)) > > > Call:lm(formula = var ~ trt + cov, data = my.df) > > > Residuals: Min 1Q Median 3Q Max -2.5778 -0.6584 -0.0185 > > > 0.6423 3.2077 > > > Coefficients: Estimate Std. Error t value Pr(>|t|) > > > (Intercept) 2.76961 0.25232 10.977 < 2e-16 ***trttrt2 -1.75490 > > > 0.28546 -6.148 1.17e-08 ***trttrt3 8.40521 0.28251 29.752 < > > > 2e-16 ***trttrt4 27.04095 0.28286 95.599 < 2e-16 ***cov > > > 0.05129 0.32523 0.158 0.875 ---Signif. codes: 0 ‘***’ 0.001 > > > ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > > Residual standard error: 1.094 on 115 degree
Re: [R-sig-eco] Regression with few observations per factor level
Yes, the analysis with a small sample size would be valid (under the assumption that the model - both fixed and random effects are correctly specified) but at some point there must be a practical assessment as to the desired precision and the costs of the consequences of either inadequate estimates or wrong acceptance or rejection of hypotheses. If it were just about the numbers from a sample and resulting P-values, we would only need statisticians and no subject-matter experts (which is clearly not the case). And while I'm soapboxing, situations with low variability require fewer samples than situations with high variability. One can't make assessments of the adequacy of an analysis solely on the sample size. Jim Jim Baldwin Station Statistician Pacific Southwest Research Station USDA Forest Service -Original Message- From: r-sig-ecology-boun...@r-project.org [mailto:r-sig-ecology-boun...@r-project.org] On Behalf Of V. Coudrain Sent: Monday, October 20, 2014 8:54 AM To: ElginPerry Cc: r-sig-ecology@r-project.org Subject: Re: [R-sig-eco] Regression with few observations per factor level Thank you for this helpful thought. So if I get it correctly it is hopeless to try testing an interaction, but we neverless may assess if a covariate has an impact, providing it is the same in all treatments. > Message du 20/10/14 à 16h46 > De : "Elgin Perry" > A : v_coudr...@voila.fr > Copie à : > Objet : Regression with few observations per factor level > > If it is reasonable to assume that the slope of the covariate is the > same for all treatments and you have numerous treatments then you can > do this by specifying one slope parameter for all treatments as you > gave in your example (e.g. lm(var ~ trt + cov)). By combining slope > information over treatments, you can obtain a reasonably precise > estimate. With so few observations per treatment, you will not be > able to estimate separate slopes for each treatment with any degree of > precision (e.g. lm(var ~ trt + trt:cov)) Elgin S. Perry, Ph.D. Statistics Consultant 377 Resolutions Rd. Colonial Beach, Va. 22443 ph. 410.610.1473 Date: Mon, 20 Oct 2014 10:53:41 +0200 (CEST) From: "V. Coudrain" < v_coudr...@voila.fr > To: r-sig-ecology@r-project.org Subject: [R-sig-eco] Regression with few observations per factor level Message-ID: < 2127199056.738451413795221981.JavaMail.www@wwinf7128 > Content-Type: text/plain; charset="UTF-8" Hi, I would like to test the impact of a treatment of some variable using regression (e.g. lm(var ~ trt + cov)).? However I only have four observations per factor level. Is it still possible to apply a regression with such a small sample size. I think that i should be difficult to correctly estimate variance.Do you think that I rather should compute a non-parametric test such as Kruskal-Wallis? However I need to include covariables in my models and I am not sure if basic non-parametric tests are suitable for this. Thanks for any suggestion. ___ Mode, hifi, maison,? J'ach?te malin. Je compare les prix avec [[alternative HTML version deleted]] ___ Mode, hifi, maison,… J'achète malin. Je compare les prix avec [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately. ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology