Hello, I'm very new to R so my apologies if I'm making an obvious mistake. I have a data frame with ~170k rows and 14 numeric variables. The first 2 of those variables (let's call them group1 and group2) are used to define groups: each unique pair of (group1,group2) is a group. There are roughly 50k such unique groups, with sizes varying from 1 through 40 rows each.
My objective is to fit a linear regression within each group and get its mean square error (MSE). So the final output needs to be a collection of 50k MSE's. Now, regardless of the size of the group, the regression needs to be run on exactly 40 observations. If the group has less than 40 observations, then I need to add rows to get to 40, populating all variables with 0's for those extra rows. Here's the function I wrote to do this: get_MSE = function(x) { rownames(x) = x$ID #'ID' can take on any value from 1 to 40. x = x[as.character(1:40), ] x[is.na(x)] = 0 regressionResult = lm(A ~ B + C + D + E, data=x) #A-E are some variables in the data frame. MSE = mean((regressionResult$fitted.values - A)^2) return(MSE) } library(plyr) output = ddply(dataset, list(dataset$group1, dataset$group2), get_MSE) The above code takes about 10 minutes to run, but I'd really need it to be much faster, if at all possible. Is there anything I can do to speed up the code? Thank you very much in advance. Jose [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.