Hi all, SETUP: I have pairwise data on 22 chromosomes. Data matrix X for a given chromosome looks like this:
1 13 58 1.12 6 142 56 1.11 18 307 64 3.13 22 320 58 0.72 Where column 1 is person ID 1, column 2 is person ID 2, column 3 can be ignored, and column 4 is how much chromosomal sharing those two individuals have in some small portion of the chromosome. There are 9000 individual people, and therefore ~ (9000^2)/2 pairwise matches at each small location on the chromosome, so across an entire chromosome, these matrices are VERY large (e.g., 3 billion rows, which is > the 2^31 vector size limitation in R). I have access to a server with 64 bit R, 1TB RAM and 80 processors. PROBLEM: A pair of individuals (e.g., person 1 and 13 from the first row above) will show up multiple times in a given file. I want to sum column 4 across each pair of individuals. If I could bring the matrix into R, I could use tapply() to accomplish this by indexing on paste(X[,1],X[,2]), but the matrix doesn't fit into R. I have been trying to use bigmemory and bigtabulate packages in R, but when I try to use the bigsplit function, R never completes the operation (after a day, I killed the process). In particular, I did this: X <- read.big.matrix("file.loc.X",sep=" ",type="double") hap.indices <- bigsplit(X,1:2) #this runs for too long to be useful on these matrices #I was then going to use foreach loop to sum across the splits identified by bigsplit SO - does anyone have ideas on how to deal with this problem - i.e., how to use a tapply() like function on an enormous matrix? This isn't necessarily a bigtabulate question (although if I screwed up using bigsplit, let me know). If another package (e.g., an SQL package) can do something like this efficiently, I'd like to hear about it and your experiences using it. Thank you in advance, Matt -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.