Hi Bernardo, Do you have to use logistic regression? If not, try Random Forests... It has worked for me in past situations when I have to analyze huge datasets.
Some want to understand the DGP with a simple linear equation; others want high generalization power. It is your call... See, e.g., www.cis.upenn.edu/group/datamining/ReadingGroup/papers/breiman2001.pdf. Maybe you are also interested in AD-HOC, an algorithm for feature selection, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.99.9130 Regards, Pedro -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Liaw, Andy Sent: Wednesday, October 01, 2008 12:01 PM To: Frank E Harrell Jr; [EMAIL PROTECTED] Cc: r-help@r-project.org Subject: Re: [R] Logistic regression problem From: Frank E Harrell Jr > > Bernardo Rangel Tura wrote: > > Em Ter, 2008-09-30 às 18:56 -0500, Frank E Harrell Jr escreveu: > >> Bernardo Rangel Tura wrote: > >>> Em Sáb, 2008-09-27 às 10:51 -0700, milicic.marko escreveu: > >>>> I have a huge data set with thousands of variable and one binary > >>>> variable. I know that most of the variables are > correlated and are not > >>>> good predictors... but... > >>>> > >>>> It is very hard to start modeling with such a huge > dataset. What would > >>>> be your suggestion. How to make a first cut... how to > eliminate most > >>>> of the variables but not to ignore potential interactions... for > >>>> example, maybe variable A is not good predictor and > variable B is not > >>>> good predictor either, but maybe A and B together are good > >>>> predictor... > >>>> > >>>> Any suggestion is welcomed > >>> > >>> milicic.marko > >>> > >>> I think do you start with a rpart("binary variable"~.) > >>> This show you a set of variables to start a model and the > start set to > >>> curoff for continous variables > >> I cannot imagine a worse way to formulate a regression > model. Reasons > >> include > >> > >> 1. Results of recursive partitioning are not trustworthy > unless the > >> sample size exceeds 50,000 or the signal to noise ratio is > extremely high. > >> > >> 2. The type I error of tests from the final regression > model will be > >> extraordinarily inflated. > >> > >> 3. False interactions will appear in the model. > >> > >> 4. The cutoffs so chosen will not replicate and in effect > assume that > >> covariate effects are discontinuous and piecewise flat. > The use of > >> cutoffs results in a huge loss of information and power > and makes the > >> analysis arbitrary and impossible to interpret (e.g., a > high covariate > >> value:low covariate value odds ratio or mean difference is > a complex > >> function of all the covariate values in the sample). > >> > >> 5. The model will not validate in new data. > > > > Professor Frank, > > > > Thank you for your explain. > > > > Well, if my first idea is wrong what is your opinion on the > following > > approach? > > > > 1- Make PCA with data excluding the binary variable > > 2- Put de principal components in logistic model > > 3- After revert principal componentes in variable (only if is > > interesting for milicic.marko) > > > > If this approach is wrong too what is your approach? > > > Hi Bernardo, > > If there is a large number of potential predictors and no previous > knowledge to guide the modeling, principal components (PC) is > often an > excellent way to proceed. The first few PCs can be put into > the model. > The result is not always very interpretable, but you can > "decode" the > PCs by using stepwise regression or recursive partitioning (which are > safer in this context because the stepwise methods are not exposed to > the Y variable). You can also add PCs in a stepwise fashion in the > pre-specified order of variance explained. > > There are many variations on this theme including nonlinear principal > components (e.g., the transcan function in the Hmisc package) > which may > explain more variance of the predictors. While I agree with much of what Frank had said, I'd like to add some points. Variable selection is a treacherous business whether one is interested in prediction or inference. If the goal is inference, Frank's book is a must read, IMHO. (It's great for predictive model building, too.) If interaction is of high interest, principal components are not going to give you that. Regarding cutpoint selection: The machine learners have found that the `optimal' split point for a continuous predictor in tree algorithms are extremely variable, that interpreting them would be risky at best. Breiman essentially gave up on interpretation of a single tree when he went to random forests. Best, Andy > Frank > -- > Frank E Harrell Jr Professor and Chair School of Medicine > Department of Biostatistics > Vanderbilt University > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > Notice: This e-mail message, together with any attachme...{{dropped:12}} ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.