On Feb 6, 2008 11:28 AM, Tony Plate <[EMAIL PROTECTED]> wrote: > Bert Gunter wrote: > > I strongly suggest you collaborate with a local statistician. I can think of > > no circumstance where multiple regression on "hundreds of thousands of > > variables" is anything more than a fancy random number generator. > > That sounds like a challenge! What is the largest regression problem (in > terms of numbers of variables) that people have encountered where it made > sense to do some sort of linear regression (and gave useful results)? > (Including multilevel and Bayesian techniques.)
I have fit linear and generalized linear models with hundreds of thousands of coefficients but, of course, with a highly structured model matrix and using sparse matrix techniques. What is called the Rasch model for analysis of item response data (e.g. correct/incorrect responses by students to the items on a multiple-choice test) is a generalized linear model with the students and the items as factors. However, like Bert I would be very dubious of any attempt to fit a linear regression model to 3000 variables that were not generated in a systematic way. Sounds like a massive, computer-fueled fishing expedition (a.k.a. "data mining"). > However, the original poster did say "hundreds to thousands", which is > smaller than "hundreds of thousands". When I try a regression problem with > 3,000 coefficients in R running under Windows XP 64 bit with 8Gb of memory > on the machine and the /3Gb option active (i.e., R can get up to 3Gb), R > 2.6.1 runs out of memory (apparently trying to duplicate the model matrix): > > R version 2.6.1 (2007-11-26) > Copyright (C) 2007 The R Foundation for Statistical Computing > ISBN 3-900051-07-0 > > > m <- 3000 > > n <- m * 10 > > x <- matrix(rnorm(n*m), ncol=m, nrow=n, > dimnames=list(paste("C",1:n,sep=""), paste("X",1:m,sep=""))) > > dim(x) > [1] 30000 3000 > > k <- sample(m, 10) > > y <- rowSums(x[,k]) + 10 * rnorm(n) > > fit <- lm.fit(y=y, x=x) > Error: cannot allocate vector of size 686.6 Mb > > object.size(x)/2^20 > [1] 687.7787 > > memory.size() > [1] -2022.552 > > > and the Windows process monitor shows the peak memory usage for Rgui.exe at > 2,137,923K. But in a 64 bit version of R, I would be surprised if it was > not possible to run this (given sufficient memory). > > However, R easily handles a slightly smaller problem: > > m <- 1000 # of variables > > n <- m * 10 # of rows > > k <- sample(m, 10) > > x <- matrix(rnorm(n*m), ncol=m, nrow=n, > dimnames=list(paste("C",1:n,sep=""), paste("X",1:m,sep=""))) > > y <- rowSums(x[,k]) + 10 * rnorm(n) > > fit <- lm.fit(y=y, x=x) > > # distribution of coefs that should be one vs zero > > round(rbind(one=quantile(fit$coefficients[k]), > zero=quantile(fit$coefficients[-k])), digits=2) > 0% 25% 50% 75% 100% > one 0.94 0.98 1.04 1.10 1.18 > zero -0.30 -0.08 -0.01 0.06 0.29 > > > > To echo Bert Gunter's cautions, one must be careful doing ordinary linear > regression with large numbers of coefficients. It does seem a little > unlikely that there is sufficient data to get useful estimates of three > thousand coefficients using linear regression in data managed in Excel > (though I guess it could be possible using Excel 12.0, which can handle up > to 1 million rows - recent versions prior to 2008 could handle on 64K rows > - see http://en.wikipedia.org/wiki/Microsoft_Excel#Versions ). So, the > suggestion to consult a local statistician is good advice - there may be > other more suitable approaches, and if some form of linear regression is an > appropriate approach, there are things to do to gain confidence that the > results of the linear regression convey useful information. > > -- Tony Plate > > > > > > -- Bert Gunter > > Genentech Nonclinical Statistics > > > > -----Original Message----- > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On > > Behalf Of Michelle Chu > > Sent: Tuesday, February 05, 2008 9:00 AM > > To: R-help@r-project.org > > Subject: [R] Maximum number of variables allowed in a multiple > > linearregression model > > > > Hi, > > > > I appreciate it if someone can confirm the maximum number of variables > > allowed in a multiple linear regression model. Currently, I am looking for > > a software with the capacity of handling approximately 3,000 variables. I > > am using Excel to process the results. Any information for processing a > > matrix from Excel with hundreds to thousands of variables will helpful. > > > > Best Regards, > > Michelle > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.