You can convert to a numeric matrix to make the operations faster: > > x=data.frame(id=rep(sample(1:100000,size=10000),2),a=sample(c(NA,rnorm(10,0,1)),size=20000, + replace=T),b=sample(c(NA,rnorm(10,0,1)),size=20000, + replace=T),c=sample(c(NA,rnorm(10,0,1)),size=20000, replace=T)) > x$id=factor(x$id) > # convert the numeric data for the 'dist' to a matrix for faster processing > x.m <- cbind(x$a, x$b, x$c) > # create a list of indices of each 'id' > x.i <- split(seq(nrow(x)), x$id) > system.time(x.d <- sapply(x.i, function(z) dist(x.m[z,]))) user system elapsed 1.56 0.00 1.67 > > > str(x.d) Named num [1:10000] 1.333 2.522 1.566 3.822 0.905 ... - attr(*, "names")= chr [1:10000] "15" "28" "37" "78" ... >
On Fri, Aug 1, 2008 at 2:45 PM, Eleni Rapsomaniki <[EMAIL PROTECTED]> wrote: > Dear R users, > > I need to come up with an efficient method to compute the correlation (or at > least, the euclidean distance if that's easier) between specific rows in a > data > frame (46,232 rows, 29 columns). The pairs of rows between which I want to > find the correlation share a common value in one of the columns. So for > example, > in the following > > x=data.frame(id=rep(sample(1:100000,size=10000),2),a=sample(c(NA,rnorm(10,0,1)),size=10000, > replace=T),b=sample(c(NA,rnorm(10,0,1)),size=10000, > replace=T),c=sample(c(NA,rnorm(10,0,1)),size=10000, replace=T)) > x$id=factor(x$id) > > I would want to compute the correlation between the two rows (for cols a,b,c) > that share the same > id. Using a for loop and dist() works but takes a long time (>1 hour, my RAM > is > 1Gb): > p=NULL > for(i in levels(x$id)){p[[i]]=dist(x[x$id==i, -1])} > > Is there a more efficient way? I thought about apply/sapply etc but I don't > think they'll work for rows and can't think of an intelligent way to make them > work! > The second problem is that I also need to know how many degrees of freedom (ie > non missing pairs of values) were used in each correlation. Is there a way to > also do this efficiently? > > I hope this makes sense! Thank you all very much in advance! > > Eleni > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.