On Mon, May 02, 2011 at 05:22:57PM -0400, Clemontina Alexander wrote: > Thanks for your response, but I guess I didn't make my question clear. > I am already familiar with the concept of dummy variables and > regression in R. My question is, can the "lars" package (or some other > lasso algorithm) handle factors? I did use dummy variables in my > original data, but lars (lasso) only shrank the coefficients of some > of the levels of one factor to 0. Is this the correct thing to do?
It's because, so far as the linear model is concerned, factors are a convenience to help us handle the dummy variables. So, yes, it's the correct thing to do. It sounds to me as though you are after a shrinkage device that will treat the factor as a whole. > Because intuitively it seems like I would want to shrink the whole > factor coefficient to 0. If this is correct, what is the > interpretation? For example, for X1, if lasso drops the coefficient > for levels A and B, but not C and D, does this mean that X1 should be > included in the model? It means that X1 should be recoded to be C, D, and the rest. Cheers Andrew > Thanks. > > > > On Mon, May 2, 2011 at 2:47 PM, David Winsemius <dwinsem...@comcast.net> > wrote: > > > > On May 2, 2011, at 10:51 AM, Steve Lianoglou wrote: > > > >> Hi, > >> > >> On Mon, May 2, 2011 at 12:45 PM, Clemontina Alexander <ckale...@ncsu.edu> > >> wrote: > >>> > >>> Hi! This is my first time posting. I've read the general rules and > >>> guidelines, but please bear with me if I make some fatal error in > >>> posting. Anyway, I have a continuous response and 29 predictors made > >>> up of continuous variables and nominal and ordinal categorical > >>> variables. I'd like to do lasso on these, but I get an error. The way > >>> I am using "lars" doesn't allow for the factors. Is there a special > >>> option or some other method in order to do lasso with cat. variables? > >>> > >>> Here is and example (considering ordinal variables as just nominal): > >>> > >>> set.seed(1) > >>> Y <- rnorm(10,0,1) > >>> X1 <- factor(sample(x=LETTERS[1:4], size=10, replace = TRUE)) > >>> X2 <- factor(sample(x=LETTERS[5:10], size=10, replace = TRUE)) > >>> X3 <- sample(x=30:55, size=10, replace=TRUE) # think age > >>> X4 <- rchisq(10, df=4, ncp=0) > >>> X <- data.frame(X1,X2,X3,X4) > >>> > >>>> str(X) > >>> > >>> 'data.frame': 10 obs. of 4 variables: > >>> $ X1: Factor w/ 4 levels "A","B","C","D": 4 1 3 1 2 2 1 2 4 2 > >>> $ X2: Factor w/ 5 levels "E","F","G","H",..: 3 4 3 2 5 5 5 1 5 3 > >>> $ X3: int 51 46 50 44 43 50 30 42 49 48 > >>> $ X4: num 2.86 1.55 1.94 2.45 2.75 ... > >>> > >>> > >>> I'd like to do: > >>> obj <- lars(x=X, y=Y, type = "lasso") > >>> > >>> Instead, what I have been doing is converting all data to continuous > >>> but I think this is really bad! > >> > >> Yeah, it is. > >> > >> Check out the "Categorical Predictor Variables" section here for a way > >> to handle such predictor vars: > >> http://www.psychstat.missouristate.edu/multibook/mlt08m.html > > > > Steve's citation is somewhat helpful, but not sufficient to take the next > > steps. You can find details regarding the mechanics of typical linear > > regression in R on the ?lm page where you find that the factor variables are > > typically handled by model.matrix. See below: > > > >> model.matrix(~X1 + X2 + X3 + X4, X) > > (Intercept) X1B X1C X1D X2F X2G X2H X2I X3 X4 > > 1 1 0 0 1 0 1 0 0 51 2.8640884 > > 2 1 0 0 0 0 0 1 0 46 1.5462243 > > 3 1 0 1 0 0 1 0 0 50 1.9430901 > > 4 1 0 0 0 1 0 0 0 44 2.4504180 > > 5 1 1 0 0 0 0 0 1 43 2.7535052 > > 6 1 1 0 0 0 0 0 1 50 1.6200326 > > 7 1 0 0 0 0 0 0 1 30 0.5750533 > > 8 1 1 0 0 0 0 0 0 42 5.9224777 > > 9 1 0 0 1 0 0 0 1 49 2.0401528 > > 10 1 1 0 0 0 1 0 0 48 6.2995288 > > attr(,"assign") > > [1] 0 1 1 1 2 2 2 2 3 4 > > attr(,"contrasts") > > attr(,"contrasts")$X1 > > [1] "contr.treatment" > > > > attr(,"contrasts")$X2 > > [1] "contr.treatment" > > > > The numeric variables are passed through, while the dummy variables for > > factor columns are constructed (as treatment contrasts) and the whole thing > > it returned in a neat package. > > > > -- > > David. > >> > >> HTH, > >> -steve > >> > > -- > > David Winsemius, MD > > Heritage Laboratories > > West Hartford, CT > > > > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Andrew Robinson Program Manager, ACERA Department of Mathematics and Statistics Tel: +61-3-8344-6410 University of Melbourne, VIC 3010 Australia (prefer email) http://www.ms.unimelb.edu.au/~andrewpr Fax: +61-3-8344-4599 http://www.acera.unimelb.edu.au/ Forest Analytics with R (Springer, 2011) http://www.ms.unimelb.edu.au/FAwR/ Introduction to Scientific Programming and Simulation using R (CRC, 2009): http://www.ms.unimelb.edu.au/spuRs/ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.