Hi:

There are some advantages to taking a plyr approach to this type of problem.
The basic idea is to fit a linear model to each subgroup and save the
results in a list, from which you can extract what you want piece by piece.

library(plyr)

# One of those SAS style data sets...
> df <- data.frame(matrix(scan(), ncol = 3, byrow = TRUE))
1: 76 36476 15.8  76 36493 66.9  76 36579 65.6  111 35465 10.3  111 35756
4.8
16: 121 38183 16  121 38184 15  121 38254 9.6  121 38255 7  168 37727 21.9
168
32: 37739 29.7  168 37746 97.4
37:
Read 36 items

# A little cleanup:
names(df) <- c('ID', 'x', 'y')
df$ID <- factor(df$ID)

# Fit a linear model to each sub-data frame identified by ID
# and send the results to a list object

# dlply takes a data frame as input and outputs a list
# the grouping variable is ID
# the argument d in the function is the sub-data frame of a given ID
lr1 <- dlply(df, .(ID), function(d) lm(y ~ x, data = d))

# So you can do things like:

# Grab the model coefficients
# (input is a list, output is a data frame)
> ldply(lr1, function(m) m$coef)
   ID  (Intercept)           x
1  76  -11699.9999  0.32176123
2 111     680.6007 -0.01890034
3 121    3900.5051 -0.10174534
4 168 -136322.4296  3.61371841

# export the R^2 values
> ldply(lr1, function(m) summary(m)$r.squared)
   ID        V1
1  76 0.3718840
2 111 1.0000000
3 121 0.9367437
4 168 0.6993811

# Extract the residuals and predicted values to another list
> llply(lr1, function(m) cbind(m$resid, m$fitted))
$`76`
        [,1]     [,2]
1 -20.762884 36.56288
2  24.867175 42.03282
3  -4.104291 69.70429

$`111`
  [,1] [,2]
4    0 10.3
5    0  4.8

$`121`
        [,1]      [,2]
6  0.4371678 15.562832
7 -0.4610869 15.461087
8  1.2610869  8.338913
9 -1.2371678  8.237168

$`168`
        [,1]     [,2]
10   9.57509 12.32491
11 -25.98953 55.68953
12  16.41444 80.98556

# Plot the residuals vs. fitted values for each model (don't blink :)
# the _ means that no object is returned; the plot is a side effect
l_ply(lr1, function(d) plot(resid(d) ~ fitted(d)))

These are just some examples; clearly, there is a lot more one could do with
this type of structure.

HTH,
Dennis

On Tue, Dec 28, 2010 at 6:23 PM, Entropi ntrp <entropy...@gmail.com> wrote:

> Hi,
> I have been examining large data and need to do simple linear regression
> with the data which is grouped based on the values of a particular
> attribute. For instance, consider three columns : ID, x, y,  and  I need to
> regress x on y for each distinct value of ID. Specifically, for the set of
> data corresponding to each of the 4 values of ID (76,111,121,168) in the
> below data, I should invoke linear regression 4 times. The challenge is
> that, the length of the ID vector is around 20000 and therefore linear
> regression must be done automatically for each distinct value of ID.
>
>               ID            x                     y
>  76 36476 15.8  76 36493 66.9  76 36579 65.6  111 35465 10.3  111 35756 4.8
> 121 38183 16  121 38184 15  121 38254 9.6  121 38255 7  168 37727 21.9  168
> 37739 29.7  168 37746 97.4
> I was wondering whether there is an easy way to group data based on the
> values of ID in R  so that linear regression can be done easily for each
> group determined by each value of ID. Or, is the only way to construct
> loops  with 'for' or 'while'  in which a matrix is generated for each
> distinct value of ID  that stores corresponding values of x and y by
> screening the entire ID vector?
>
> Thanks in advance,
>
> Yasin
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to