Hi all.  Sorry for the long email.  I have been trying to find someone local 
to work on this with me, without much luck.  I went in to our local stats 
consulting service here, and the guy there told me that I already know more 
about model selection than he does.  :-<  He pointed me towards another 
professor that can perhaps help, but that prof is busy until mid-June, so I 
want to get as much figured out as I can before I eventually meet with him.  
And that prof may turn out not to be much help anyway, in which case all I've 
got is you folks.  :->

  So I've got a big dataset (300,000 rows) in which the dependent variable is 
binary, and there are five independent variables, each continuous, and each 
uniformly distributed between 0 and 1.  I'm trying to use binary logistic 
regression to get an explanatory model with decent predictive value (but being 
simple, for explanation, is more important than being optimally predictive).  
Considering the squares of those variables as well (but no higher powers, this 
list talked me out of that recently because of all the problems with 
collinearity and polynomial fits), and many interactions between both the 
variables and the squares of the variables, I have a full model formula of 116 
terms; but most of those terms are of small effect (although often still 
significant in a glm fit, since 300,000 rows is a lot of data).  So I have a 
model selection problem: I want to find a simple explanatory model containing 
just the terms of large effect that are most important in getting an unders!
 tanding of what strongly influences the dependent variable.

  I was previously trying to do model selection using a step-down approach with 
a large per-term penalty (much larger than the standard BIC penalty) to force 
more terms to drop out.  I looked at a wide range of penalty values, got a 
corresponding set of models with different numbers of terms, looked at the 
correct prediction rate for each of those models, and basically chose the 
simplest model that still gave me a pretty good prediction rate (for some 
subjective definition of "pretty good").  Typically the model I chose would 
have about 20 terms, out of the original 116.  (A standard BIC step-down would 
retain more like 100 terms, with only a very slightly better prediction rate.)

  So that was working OK, but in discussions with the folks on this list 
(thanks everybody for your help!), I have been exploring using the lasso for 
this instead, to avoid the problems with step-down, gain the benefits of 
shrinkage, and so forth; clearly it should be much better than my homegrown 
model selection procedure.  I've been reading about the lasso in Tibshirani 
(1996) and in The Elements of Statistical Learning.  I'm using glmnet() at 
present; I have seen that lars() also exists, but I don't understand its 
documentation as well, so I'm starting with glmnet.  So there's one question:

1. Is my choice of glmnet() ok?  On what basis should I choose glmnet() vs. 
lars()?

  The lasso wants variables to be centered and scaled, I gather.  glmnet() can 
do this for me, but I want to understand exactly what the variables are that 
the fit is done on, so that I can interpret the coefficients properly, so I 
want to do this myself.  (I'm also concerned that glmnet() might not do it 
correctly, since it doesn't know that some of the terms in the formula are 
squares and interactions.)  So I'm passing standardize=FALSE to glmnet(), and 
I'm doing my own scaling before, like (where df is my dataframe):

df$Cx <- scale(df$x)            # an independent variable, centered and scaled
df$Cx_sq <- df$Cx ^ 2   # the square of that variable

  I do this for each of the five independent variables.  So the variables 
themselves are centered and scaled, while the squared versions of the variables 
are exactly the square of the scaled variable; I do not scale them again.  So 
question 2:

2. Is the way I'm scaling the variables before calling glmnet() correct?  Or 
should the squares themselves be centered and scaled?

  Having scaled the variables in this way, I then construct a model matrix and 
call glmnet() (where f is the 116-term formula and df is my dataframe):

mf <- model.frame(f, df)
mm <- model.matrix(formula(f), mf)[,-1]
lasso <- glmnet(mm, y=df$outcome, family="binomial", standardize=FALSE)

  I do it this way because glmnet() doesn't support being passed a formula and 
a dataframe.  I think this is doing the right thing.  The model.matrix() call 
constructs new columns for all of the interactions in the formula, which of 
course act as separate independent variables in the regression.  One worry I 
have is that those, like the squares discussed above, are not themselves 
centered and scaled.  If there's an interaction between, say, Cx and Cy, then 
the model matrix column for Cx:Cy is of course just the product of the Cx 
column and the Cy column, and so it is not centered/scaled.  I don't know if 
this is correct or not.  So question 3:

3. Is my model matrix correct, or do I have a problem with the scale of the 
interaction variables?

  Having run glmnet(), I now have lots of coefficients for the different values 
of lambda.  One thing that seems problematic is that the coefficient for an 
interaction (Cx:Cy, say), will become non-zero while the coefficients for the 
underlying terms Cx and Cy are still zero.  Of course glmnet() has no way of 
preventing this, since it doesn't know the formula or which matrix columns are 
interactions, etc.  From my previous work on model selection via step-down, my 
understanding is that this state of affairs should not be allowed.  Using 
step-down, Cx and Cy will not be dropped as long as Cx:Cy is retained, for 
example.  As I understand it, this is for essentially the same conceptual 
reasons that one ought to retain the intercept in a regression even if it is 
non-significant; the slope of the regression depends upon it, and so dropping 
it is not correct.  However, I don't see any way that I can tell glmnet() about 
these relationships, since it doesn't have a functional interf!
 ace.  So:

4. Is it a problem that the lasso fit gives non-zero coefficients for 
interactions whose underlying terms have zero coefficients?

  And finally comes the actual model selection problem.  The best lambda value 
given by cv.glmnet() is for a model that contains most of the 116 terms; I 
suppose that, as with the BIC step-down, this is because 300,000 rows is a lot 
of data, and so most of those terms are in fact well-supported by the data.  
But I want a simple explanatory model with only the terms of large effect, as 
described before.  I'm thinking of finding the prediction rate for each lambda 
value used by glmnet, and using the same subjective "simplest model with a 
pretty good prediction rate" criterion that I was using before.  But I'm 
wondering:

5. Is there any way to choose a simple explanatory model, smaller than the best 
predictive model supported by the data, that is less arbitrary / subjective?

  And finally, I've got sort of a philosophical question that I'm struggling 
with that makes me wonder if the lasso is really the right tool for my job in 
the first place.  As lambda increases, the coefficients of terms diverge from 
zero, one by one, until all terms have non-zero coefficients and you've arrived 
at the full model.  The funny thing is, for most values of lambda, some or 
another term in my formula will have only just recently diverged from zero, and 
so that term will have a very small coefficient -- the shrinkage aspect of the 
lasso will have shrunk it almost, but not quite, to zero.  This is weird, given 
that my goal is a simple explanatory model; having terms with very small but 
non-zero coefficients is contrary to the goals of my statistical analysis.  The 
lasso is considered to be better for producing explanatory models than some 
other methods, because coefficients do eventually go to zero, instead of just 
getting smaller and smaller; but still, I wonder if!
  I ought to be using some more extreme method of model selection that, like 
the step-down I was using before, either keeps a term in or drops it, with no 
in-between.  On the other hand, I gather that the shrinkage that the lasso does 
is considered to be a good thing.  I don't really understand enough about all 
this to know what to do.  So:

6. Should I be looking at a method other than the lasso, so that I would not 
get some terms with very small coefficients?  Or should I use the lasso, and 
just ignore, in my explanation of the model, terms with tiny coefficients?  Or 
is there some other way of handling this problem?

  Thanks to anybody who has even read this far, and many many thanks to any 
responses!

Ben Haller
McGill University

http://biology.mcgill.ca/grad/ben/

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to