On May 2, 2011, at 2:22 PM, Clemontina Alexander wrote:

Thanks for your response, but I guess I didn't make my question clear.
I am already familiar with the concept of dummy variables and
regression in R. My question is, can the "lars" package (or some other
lasso algorithm) handle factors?

The error message when you do so and the help page make it fairly clear that it does not.

I did use dummy variables in my
original data, but lars (lasso) only shrank the coefficients of some
of the levels of one factor to 0.

You certainly gave no evidence that would lead anyone to think that you did so. Please try to understand that just converting factors to 'numeric' is not the same as creating dummy variables.

--
David.
Is this the correct thing to do?
Because intuitively it seems like I would want to shrink the whole
factor coefficient to 0. If this is correct, what is the
interpretation? For example, for X1, if lasso drops the coefficient
for levels A and B, but not C and D, does this mean that X1 should be
included in the model?
Thanks.



On Mon, May 2, 2011 at 2:47 PM, David Winsemius <dwinsem...@comcast.net > wrote:

On May 2, 2011, at 10:51 AM, Steve Lianoglou wrote:

Hi,

On Mon, May 2, 2011 at 12:45 PM, Clemontina Alexander <ckale...@ncsu.edu >
wrote:

Hi! This is my first time posting. I've read the general rules and
guidelines, but please bear with me if I make some fatal error in
posting. Anyway, I have a continuous response and 29 predictors made
up of continuous variables and nominal and ordinal categorical
variables. I'd like to do lasso on these, but I get an error. The way
I am using "lars" doesn't allow for the factors. Is there a special
option or some other method in order to do lasso with cat. variables?

Here is and example (considering ordinal variables as just nominal):

set.seed(1)
Y <- rnorm(10,0,1)
X1 <- factor(sample(x=LETTERS[1:4], size=10, replace = TRUE))
X2 <- factor(sample(x=LETTERS[5:10], size=10, replace = TRUE))
X3 <- sample(x=30:55, size=10, replace=TRUE)  # think age
X4 <- rchisq(10, df=4, ncp=0)
X <- data.frame(X1,X2,X3,X4)

str(X)

'data.frame':   10 obs. of  4 variables:
 $ X1: Factor w/ 4 levels "A","B","C","D": 4 1 3 1 2 2 1 2 4 2
 $ X2: Factor w/ 5 levels "E","F","G","H",..: 3 4 3 2 5 5 5 1 5 3
 $ X3: int  51 46 50 44 43 50 30 42 49 48
 $ X4: num  2.86 1.55 1.94 2.45 2.75 ...


I'd like to do:
obj <- lars(x=X, y=Y, type = "lasso")

Instead, what I have been doing is converting all data to continuous
but I think this is really bad!

Yeah, it is.

Check out the "Categorical Predictor Variables" section here for a way
to handle such predictor vars:
http://www.psychstat.missouristate.edu/multibook/mlt08m.html

Steve's citation is somewhat helpful, but not sufficient to take the next
steps. You can find details regarding the mechanics of typical linear
regression in R on the ?lm page where you find that the factor variables are
typically handled by model.matrix. See below:

model.matrix(~X1 + X2 + X3 + X4, X)
  (Intercept) X1B X1C X1D X2F X2G X2H X2I X3        X4
1            1   0   0   1   0   1   0   0 51 2.8640884
2            1   0   0   0   0   0   1   0 46 1.5462243
3            1   0   1   0   0   1   0   0 50 1.9430901
4            1   0   0   0   1   0   0   0 44 2.4504180
5            1   1   0   0   0   0   0   1 43 2.7535052
6            1   1   0   0   0   0   0   1 50 1.6200326
7            1   0   0   0   0   0   0   1 30 0.5750533
8            1   1   0   0   0   0   0   0 42 5.9224777
9            1   0   0   1   0   0   0   1 49 2.0401528
10           1   1   0   0   0   1   0   0 48 6.2995288
attr(,"assign")
 [1] 0 1 1 1 2 2 2 2 3 4
attr(,"contrasts")
attr(,"contrasts")$X1
[1] "contr.treatment"

attr(,"contrasts")$X2
[1] "contr.treatment"

The numeric variables are passed through, while the dummy variables for factor columns are constructed (as treatment contrasts) and the whole thing
it returned in a neat package.

--
David.

HTH,
-steve

--
David Winsemius, MD
Heritage Laboratories
West Hartford, CT



David Winsemius, MD
Heritage Laboratories
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to