Hi: There are some advantages to taking a plyr approach to this type of problem. The basic idea is to fit a linear model to each subgroup and save the results in a list, from which you can extract what you want piece by piece.
library(plyr) # One of those SAS style data sets... > df <- data.frame(matrix(scan(), ncol = 3, byrow = TRUE)) 1: 76 36476 15.8 76 36493 66.9 76 36579 65.6 111 35465 10.3 111 35756 4.8 16: 121 38183 16 121 38184 15 121 38254 9.6 121 38255 7 168 37727 21.9 168 32: 37739 29.7 168 37746 97.4 37: Read 36 items # A little cleanup: names(df) <- c('ID', 'x', 'y') df$ID <- factor(df$ID) # Fit a linear model to each sub-data frame identified by ID # and send the results to a list object # dlply takes a data frame as input and outputs a list # the grouping variable is ID # the argument d in the function is the sub-data frame of a given ID lr1 <- dlply(df, .(ID), function(d) lm(y ~ x, data = d)) # So you can do things like: # Grab the model coefficients # (input is a list, output is a data frame) > ldply(lr1, function(m) m$coef) ID (Intercept) x 1 76 -11699.9999 0.32176123 2 111 680.6007 -0.01890034 3 121 3900.5051 -0.10174534 4 168 -136322.4296 3.61371841 # export the R^2 values > ldply(lr1, function(m) summary(m)$r.squared) ID V1 1 76 0.3718840 2 111 1.0000000 3 121 0.9367437 4 168 0.6993811 # Extract the residuals and predicted values to another list > llply(lr1, function(m) cbind(m$resid, m$fitted)) $`76` [,1] [,2] 1 -20.762884 36.56288 2 24.867175 42.03282 3 -4.104291 69.70429 $`111` [,1] [,2] 4 0 10.3 5 0 4.8 $`121` [,1] [,2] 6 0.4371678 15.562832 7 -0.4610869 15.461087 8 1.2610869 8.338913 9 -1.2371678 8.237168 $`168` [,1] [,2] 10 9.57509 12.32491 11 -25.98953 55.68953 12 16.41444 80.98556 # Plot the residuals vs. fitted values for each model (don't blink :) # the _ means that no object is returned; the plot is a side effect l_ply(lr1, function(d) plot(resid(d) ~ fitted(d))) These are just some examples; clearly, there is a lot more one could do with this type of structure. HTH, Dennis On Tue, Dec 28, 2010 at 6:23 PM, Entropi ntrp <entropy...@gmail.com> wrote: > Hi, > I have been examining large data and need to do simple linear regression > with the data which is grouped based on the values of a particular > attribute. For instance, consider three columns : ID, x, y, and I need to > regress x on y for each distinct value of ID. Specifically, for the set of > data corresponding to each of the 4 values of ID (76,111,121,168) in the > below data, I should invoke linear regression 4 times. The challenge is > that, the length of the ID vector is around 20000 and therefore linear > regression must be done automatically for each distinct value of ID. > > ID x y > 76 36476 15.8 76 36493 66.9 76 36579 65.6 111 35465 10.3 111 35756 4.8 > 121 38183 16 121 38184 15 121 38254 9.6 121 38255 7 168 37727 21.9 168 > 37739 29.7 168 37746 97.4 > I was wondering whether there is an easy way to group data based on the > values of ID in R so that linear regression can be done easily for each > group determined by each value of ID. Or, is the only way to construct > loops with 'for' or 'while' in which a matrix is generated for each > distinct value of ID that stores corresponding values of x and y by > screening the entire ID vector? > > Thanks in advance, > > Yasin > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.