It looks like what you are doing is reasonably efficient: I do think there's a residuals element to the object returned by lm() so you could just call that directly (which will be just a little more efficient).
The bulk of the time is probably being taken up in the lm() call, which has alot of overhead: you could use fastLm from the RcppArmadillo package or lm.fit() directly to cut alot of this out. Michael On Wed, Feb 22, 2012 at 9:10 PM, Martin <misen...@gmail.com> wrote: > Hello, > I'm very new to R so my apologies if I'm making an obvious mistake. > > I have a data frame with ~170k rows and 14 numeric variables. The first 2 > of those variables (let's call them group1 and group2) are used to define > groups: each unique pair of (group1,group2) is a group. There are roughly > 50k such unique groups, with sizes varying from 1 through 40 rows each. > > My objective is to fit a linear regression within each group and get its > mean square error (MSE). So the final output needs to be a collection of > 50k MSE's. Now, regardless of the size of the group, the regression needs > to be run on exactly 40 observations. If the group has less than 40 > observations, then I need to add rows to get to 40, populating all > variables with 0's for those extra rows. Here's the function I wrote to do > this: > > get_MSE = function(x) { > rownames(x) = x$ID #'ID' can take on any value from 1 to 40. > x = x[as.character(1:40), ] > x[is.na(x)] = 0 > regressionResult = lm(A ~ B + C + D + E, data=x) #A-E are some variables > in the data frame. > MSE = mean((regressionResult$fitted.values - A)^2) > return(MSE) > } > > library(plyr) > output = ddply(dataset, list(dataset$group1, dataset$group2), get_MSE) > > The above code takes about 10 minutes to run, but I'd really need it to be > much faster, if at all possible. Is there anything I can do to speed up the > code? > > Thank you very much in advance. > > Jose > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.