For performance reasons, I advise on using the following function instead of model.matrix:
factorsToDummyVariables<-function(dfr, betweenColAndLevel="") { nc<-dim(dfr)[2] firstRow<-dfr[1,] coln<-colnames(dfr) retval<-do.call(cbind, lapply(seq(nc), function(ci){ if(is.factor(firstRow[,ci])) { lvls<-levels(firstRow[,ci])[-1] stretchedcols<-sapply(lvls, function(lvl){ rv<-dfr[,ci]==lvl mode(rv)<-"integer" return(rv) }) if(!is.matrix(stretchedcols)) stretchedcols<-matrix(stretchedcols, nrow=1) colnames(stretchedcols)<-paste(coln[ci], lvls, sep=betweenColAndLevel) return(stretchedcols) } else { curcol<-matrix(dfr[,ci], ncol=1) colnames(curcol)<-coln[ci] return(curcol) } })) rownames(retval)<-rownames(dfr) return(retval) } Just for comparison: here is my old version of the same function, using model.matrix: factorsToDummyVariables.old<-function(dfrPredictors, form=paste("~",paste(colnames(dfrPredictors), collapse="+"), sep="")) { #note: this function seems to operate quite slowly! #Because it is used often, it may be worth improving its speed dfrTmp<-model.frame(dfrPredictors, na.action=na.pass) frm<-as.formula(form) mm<-model.matrix(frm, data=dfrTmp) retval<-as.matrix(mm)[,-1] return(retval) } In a testcase with a reasonably big dataset, I compared the speeds: #system.time(tmp.fd.convds.full.man<-manualFactorsToDummyVariables(ds)) ## user system elapsed ## 9.44 0.00 9.48 #system.time(tmp.fd.convds.full<-factorsToDummyVariables.old(ds)) ## user system elapsed ## 15.49 0.00 15.64 #system.time(invisible(factorsToDummyVariables (ds[10,]))) ## user system elapsed ## 0.36 0.00 0.36 #system.time(invisible(factorsToDummyVariables.old (ds[10,]))) ## user system elapsed ## 2.18 0.00 2.20 #system.time(invisible(factorsToDummyVariables (ds[20:30,]))) ## user system elapsed ## 0.34 0.00 0.38 #system.time(invisible(factorsToDummyVariables.old (ds[20:30,]))) ## user system elapsed ## 2.11 0.00 2.15 If you have to do this quite often, the difference surely adds up... More improvements may be possible. This function only works if you don't include interactions, though. Nick Sabbe -- ping: nick.sa...@ugent.be link: http://biomath.ugent.be wink: A1.056, Coupure Links 653, 9000 Gent ring: 09/264.59.36 -- Do Not Disapprove -----Original Message----- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of David Winsemius Sent: maandag 2 mei 2011 20:48 To: Steve Lianoglou Cc: r-help@r-project.org Subject: Re: [R] Lasso with Categorical Variables On May 2, 2011, at 10:51 AM, Steve Lianoglou wrote: > Hi, > > On Mon, May 2, 2011 at 12:45 PM, Clemontina Alexander <ckale...@ncsu.edu > > wrote: >> Hi! This is my first time posting. I've read the general rules and >> guidelines, but please bear with me if I make some fatal error in >> posting. Anyway, I have a continuous response and 29 predictors made >> up of continuous variables and nominal and ordinal categorical >> variables. I'd like to do lasso on these, but I get an error. The way >> I am using "lars" doesn't allow for the factors. Is there a special >> option or some other method in order to do lasso with cat. variables? >> >> Here is and example (considering ordinal variables as just nominal): >> >> set.seed(1) >> Y <- rnorm(10,0,1) >> X1 <- factor(sample(x=LETTERS[1:4], size=10, replace = TRUE)) >> X2 <- factor(sample(x=LETTERS[5:10], size=10, replace = TRUE)) >> X3 <- sample(x=30:55, size=10, replace=TRUE) # think age >> X4 <- rchisq(10, df=4, ncp=0) >> X <- data.frame(X1,X2,X3,X4) >> >>> str(X) >> 'data.frame': 10 obs. of 4 variables: >> $ X1: Factor w/ 4 levels "A","B","C","D": 4 1 3 1 2 2 1 2 4 2 >> $ X2: Factor w/ 5 levels "E","F","G","H",..: 3 4 3 2 5 5 5 1 5 3 >> $ X3: int 51 46 50 44 43 50 30 42 49 48 >> $ X4: num 2.86 1.55 1.94 2.45 2.75 ... >> >> >> I'd like to do: >> obj <- lars(x=X, y=Y, type = "lasso") >> >> Instead, what I have been doing is converting all data to continuous >> but I think this is really bad! > > Yeah, it is. > > Check out the "Categorical Predictor Variables" section here for a way > to handle such predictor vars: > http://www.psychstat.missouristate.edu/multibook/mlt08m.html Steve's citation is somewhat helpful, but not sufficient to take the next steps. You can find details regarding the mechanics of typical linear regression in R on the ?lm page where you find that the factor variables are typically handled by model.matrix. See below: > model.matrix(~X1 + X2 + X3 + X4, X) (Intercept) X1B X1C X1D X2F X2G X2H X2I X3 X4 1 1 0 0 1 0 1 0 0 51 2.8640884 2 1 0 0 0 0 0 1 0 46 1.5462243 3 1 0 1 0 0 1 0 0 50 1.9430901 4 1 0 0 0 1 0 0 0 44 2.4504180 5 1 1 0 0 0 0 0 1 43 2.7535052 6 1 1 0 0 0 0 0 1 50 1.6200326 7 1 0 0 0 0 0 0 1 30 0.5750533 8 1 1 0 0 0 0 0 0 42 5.9224777 9 1 0 0 1 0 0 0 1 49 2.0401528 10 1 1 0 0 0 1 0 0 48 6.2995288 attr(,"assign") [1] 0 1 1 1 2 2 2 2 3 4 attr(,"contrasts") attr(,"contrasts")$X1 [1] "contr.treatment" attr(,"contrasts")$X2 [1] "contr.treatment" The numeric variables are passed through, while the dummy variables for factor columns are constructed (as treatment contrasts) and the whole thing it returned in a neat package. -- David. > > HTH, > -steve > -- David Winsemius, MD Heritage Laboratories West Hartford, CT ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.