[R] logistic regression in an incomplete dataset
Dear all, I want to do a logistic regression. So far I've only found out how to do that in R, in a dataset of complete cases. I'd like to do logistic regression via max likelihood, using all the study cases (complete and incomplete). Can you help? I'm using glm() with family=binomial(logit). If any covariate in a study case is missing then the study case is dropped, i.e. it is doing a complete cases analysis. As a lot of study cases are being dropped, I'd rather it did maximum likelihood using all the study cases. I tried setting glm()'s na.action to NULL, but then it complained about NA's present in the study cases. I've about 1000 unmatched study cases and less than 10 covariates so could use unconditional ML estimation (as opposed to conditional ML estimation). regards Desmond -- Desmond Campbell UCL Genetics Institute d.campb...@ucl.ac.uk Tel. ext. 020 31084006, int. 54006 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] logistic regression in an incomplete dataset
Dear all, I want to do a logistic regression. So far I've only found out how to do that in R, in a dataset of complete cases. I'd like to do logistic regression via max likelihood, using all the study cases (complete and incomplete). Can you help? I'm using glm() with family=binomial(logit). If any covariate in a study case is missing then the study case is dropped, i.e. it is doing a complete cases analysis. As a lot of study cases are being dropped, I'd rather it did maximum likelihood using all the study cases. I tried setting glm()'s na.action to NULL, but then it complained about NA's present in the study cases. I've about 1000 unmatched study cases and less than 10 covariates so could use unconditional ML estimation (as opposed to conditional ML estimation). regards Desmond -- Desmond Campbell UCL Genetics Institute d.campb...@ucl.ac.uk Tel. ext. 020 31084006, int. 54006 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression in an incomplete dataset
Hi Bert, Thanks for your reply. I AM making an assumption of MAR data, because informative missingness (I assume you mean NMAR) is too hard to deal with I have quite a few covariates (so the observed is likely to predict the missing and mitigate against informative missingness) the missingness is not supposed to be censoring I doubt the missingness on the covariates (mostly environmental type measures) is censoring with respect to the independent variables which are genotypes I don't like complete case logistic regression because it is less robust and throws away info However I don't have time to do anything clever so I'm just going to go along with the complete case logistic regression. Thanks again. regards Desmond Bert Gunter wrote: Desmond: The problem with ML with missing data is both the M and the L. In MAR, the L factors into a part involving the missingness parameters and the model parameters, and you can maximize the model parameters part without having to worry about missingness because they depend only on the observed data. (MCAR is even easier, since missingness doesn't change the likelihood). For informative missingness you have to come up with an L to maximize, and this is hard. There's also no way of checking the adequacy of the L (since the data to check it are missing). And when you choose your L, the M may be hard to do numerically. As Emmanuel indicated, Bayes may help, but now I'm at he end of MY knowledge. Note that in many cases, "missing" is actually not missing -- it's censoring. And for that, likelihoods can be obtained (and maximized). Cheers, Bert Gunter Genentech Nonclinical Biostatistics -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Desmond D Campbell Sent: Monday, April 05, 2010 3:19 PM To: Emmanuel Charpentier Cc: r-help@r-project.org; Desmond Campbell Subject: Re: [R] logistic regression in an incomplete dataset Dear Emmanuel, Thank you. Yes I broadly agree with what you say. I think ML is a better strategy than complete case, because I think its estimates will be more robust than complete case. For unbiased estimates I think ML requires the data is MAR, complete case requires the data is MCAR Anyway I would have thought ML could be done without resorting to Multiple Imputation, but I'm at the edge of my knowledge here. Thanks once again, regards Desmond From: Emmanuel Charpentier bacbuc.dyndns.org> Subject: Re: logistic regression in an incomplete dataset Newsgroups: gmane.comp.lang.r.general Date: 2010-04-05 19:58:20 GMT (2 hours and 10 minutes ago) Dear Desmond, a somewhat analogous question has been posed recently (about 2 weeks ago) on the sig-mixed-model list, and I tried (in two posts) to give some elements of information (and some bibliographic pointers). To summarize tersely : - a model of "information missingness" (i. e. *why* are some data missing ?) is necessary to choose the right measures to take. Two special cases (Missing At Random and Missing Completely At Random) allow for (semi-)automated compensation. See literature for further details. - complete-case analysis may give seriously weakened and *biased* results. Pairwise-complete-case analysis is usually *worse*. - simple imputation leads to underestimated variances and might also give biased results. - multiple imputation is currently thought of a good way to alleviate missing data if you have a missingness model (or can honestly bet on MCAR or MAR), and if you properly combine the results of your imputations. - A few missing data packages exist in R to handle this case. My ersonal selection at this point would be mice, mi, Amelia, and possibly mitools, but none of them is fully satisfying(n particular, accounting for a random effect needs special handling all the way in all packages...). - An interesting alternative is to write a full probability model (in BUGS fo example) and use Bayesian estimation ; in this framework, missing data are "naturally" modeled in the model used for analysis. However, this might entail *large* work, be difficult and not always succeed (numerical difficulties. Furthermore, the results of a Byesian analysis might not be what you seek... HTH, Emmanuel Charpentier Le lundi 05 avril 2010 à 11:34 +0100, Desmond Campbell a écrit : Dear all, I want to do a logistic regression. So far I've only found out how to do that in R, in a dataset of complete cases. I'd like to do logistic regression via max likelihood, using all the study cases (complete and incomplete). Can you help? I'm using glm() with family=binomial(logit). If any covariate in a study case is missing then the study case is dropped, i.e. it is doing a complete cases analysis. As a lot of study cases are being dropped, I'd
Re: [R] logistic regression in an incomplete dataset
Dear Thomas, Thanks for your reply. Yes you are quite right (your example) complete case does not require MCAR, however as well as being a bit less robust than ML it is throwing away data. Missing Data in Clinical Studies, Geert Molenberghs, Michael Kenward, have a nice section in chapter 3 or 4 where they rubbish Complete Case and Last Case Carried Forward. Ah well, I don't have time to do anything clever so I'm just going to go along with the complete case logistic regression. regards Desmond Thomas Lumley wrote: On Mon, 5 Apr 2010, Desmond D Campbell wrote: Dear Emmanuel, Thank you. Yes I broadly agree with what you say. I think ML is a better strategy than complete case, because I think its estimates will be more robust than complete case. For unbiased estimates I think ML requires the data is MAR, complete case requires the data is MCAR Anyway I would have thought ML could be done without resorting to Multiple Imputation, but I'm at the edge of my knowledge here. This is an illustration of why Rubin's hierarchy, while useful, doesn't displace actual thinking about the problem. The maximum-likelihood problem for which the MAR assumption is sufficient involves specifying the joint likelihood for the outcome and all predictor variables, which is basically the same problem as multiple imputation. Multiple imputation averages the estimate over the distribution of the unknown values; maximum likelihood integrates out the unknown values, but for reasonably large sample sizes the estimates will be equivalent (by asymptotic linearity of the estimator). Standard error calculation is probably easier with multiple imputation. Also, it is certainly not true that a complete-case regression analysis requires MCAR. For example, if the missingness is independent of Y given X, the complete-case distribution will have the same mean of Y given X as the population and so will have the same best-fitting regression. This is a stronger assumption than you need for multiple imputation, but not a lot stronger. -thomas Thanks once again, regards Desmond From: Emmanuel Charpentier bacbuc.dyndns.org> Subject: Re: logistic regression in an incomplete dataset Newsgroups: gmane.comp.lang.r.general Date: 2010-04-05 19:58:20 GMT (2 hours and 10 minutes ago) Dear Desmond, a somewhat analogous question has been posed recently (about 2 weeks ago) on the sig-mixed-model list, and I tried (in two posts) to give some elements of information (and some bibliographic pointers). To summarize tersely : - a model of "information missingness" (i. e. *why* are some data missing ?) is necessary to choose the right measures to take. Two special cases (Missing At Random and Missing Completely At Random) allow for (semi-)automated compensation. See literature for further details. - complete-case analysis may give seriously weakened and *biased* results. Pairwise-complete-case analysis is usually *worse*. - simple imputation leads to underestimated variances and might also give biased results. - multiple imputation is currently thought of a good way to alleviate missing data if you have a missingness model (or can honestly bet on MCAR or MAR), and if you properly combine the results of your imputations. - A few missing data packages exist in R to handle this case. My ersonal selection at this point would be mice, mi, Amelia, and possibly mitools, but none of them is fully satisfying(n particular, accounting for a random effect needs special handling all the way in all packages...). - An interesting alternative is to write a full probability model (in BUGS fo example) and use Bayesian estimation ; in this framework, missing data are "naturally" modeled in the model used for analysis. However, this might entail *large* work, be difficult and not always succeed (numerical difficulties. Furthermore, the results of a Byesian analysis might not be what you seek... HTH, Emmanuel Charpentier Le lundi 05 avril 2010 à 11:34 +0100, Desmond Campbell a écrit : Dear all, I want to do a logistic regression. So far I've only found out how to do that in R, in a dataset of complete cases. I'd like to do logistic regression via max likelihood, using all the study cases (complete and incomplete). Can you help? I'm using glm() with family=binomial(logit). If any covariate in a study case is missing then the study case is dropped, i.e. it is doing a complete cases analysis. As a lot of study cases are being dropped, I'd rather it did maximum likelihood using all the study cases. I tried setting glm()'s na.action to NULL, but then it complained about NA's present in the study cases. I've about 1000 unmatched study cases and less than 10 covariates so could use unconditional ML estimation (as opposed to conditional ML estimation). regards Desmond -- Desmond Camp
[R] test for whether dataset comes from a known MVN
Dear Ben Bolker, Thanks for replying and offering advice, unfortunately it doesn't solve my problem. 1) The mshapiro.test() in the mvnormtest package appears only applicable for datasets containing 3-5000 samples, whereas my dataset contains 100,000 samples. 2) As you said in your email if my data is from the real world then any test is likely to reject the null hypothesis, because of the power of such a large dataset. However my data is not from the real world. I am conducting validation studies, and if the program I am testing is working correctly then the dataset will be perfectly normally distributed. Thanks anyway. regards Desmond Campbell > Campbell, Desmond wrote: > > Dear all, > > I have a multivariate dataset containing 100,000 or more points. > I want find the p-value for the dataset of points coming from a > particular multivariate normal distribution > With > mean vector u > Covariance matrix s2 > So > H0: points ~ MVN( u, s2) > H1: points not ~ MVN( u, s2) > How do I find the p-value in R? > > Ben Bolker wrote: > >Googling for "Shapiro-Wilk multivariate" brings up mshapiro.test() > > in the mvnormtest package. However, I would strongly suspect that > > if your data are from the real world that you will reject the null > > hypothesis > > of multivariate normality when you have 100,000 points -- the power > > to detect tiny (unimportant?) deviations from MVN will be very high. > > > > cheers > > Ben Bolker It's about the oil, stupid! ("`-/")_.-'"``-._ . . `; -._)-;-,_`) (v_,)' _ )`-.\ ``-' _.- _..-_/ / ((.' ((,.-' ((,/ ___ Want ideas for reducing your carbon footprint? Visit Yahoo! For Good http://uk.promotions.yahoo.com/forgood/environment.html __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.