[R] Improving performance of split-apply problem

Martin Wed, 22 Feb 2012 19:06:56 -0800

Hello,
I'm very new to R so my apologies if I'm making an obvious mistake.

I have a data frame with ~170k rows and 14 numeric variables. The first 2
of those variables (let's call them group1 and group2) are used to define
groups: each unique pair of (group1,group2) is a group. There are roughly
50k such unique groups, with sizes varying from 1 through 40 rows each.


My objective is to fit a linear regression within each group and get its
mean square error (MSE). So the final output needs to be a collection of
50k MSE's.  Now, regardless of the size of the group, the regression needs
to be run on exactly 40 observations. If the group has less than 40
observations, then I need to add rows to get to 40, populating all
variables with 0's for those extra rows. Here's the function I wrote to do
this:

get_MSE = function(x) {
  rownames(x) = x$ID  #'ID' can take on any value from 1 to 40.
  x = x[as.character(1:40), ]
  x[is.na(x)] = 0
  regressionResult = lm(A ~ B + C + D + E, data=x)  #A-E are some variables
in the data frame.
  MSE = mean((regressionResult$fitted.values - A)^2)
  return(MSE)
}

library(plyr)
output = ddply(dataset, list(dataset$group1, dataset$group2), get_MSE)

The above code takes about 10 minutes to run, but I'd really need it to be
much faster, if at all possible. Is there anything I can do to speed up the
code?

Thank you very much in advance.

Jose

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Improving performance of split-apply problem

Reply via email to