On Wed, Mar 6, 2013 at 4:18 PM, Yao He <yao.h.1...@gmail.com> wrote: > Dear all: > > I have a big data file of 60000 columns and 60000 rows like that: > > AA AC AA AA .......AT > CC CC CT CT.......TC > .......................... > ......................... > > I want to transpose it and the output is a new like that > AA CC ............ > AC CC............ > AA CT............. > AA CT......... > .................... > .................... > AT TC............. > > The keypoint is I can't read it into R by read.table() because the > data is too large,so I try that: > c<-file("silygenotype.txt","r") > geno_t<-list() > repeat{ > line<-readLines(c,n=1) > if (length(line)==0)break #end of file > line<-unlist(strsplit(line,"\t")) > geno_t<-cbind(geno_t,line) > } > write.table(geno_t,"xxx.txt") > > It works but it is too slow ,how to optimize it???
I hate to be negative, but this will also not work on a 60000x 60000 matrix. At some point R will complain either about the lack of memory or about you trying to allocate a vector that is too long. I think your best bet is to look at file-backed data packages (for example, package bigmemory). Look at this URL: http://cran.r-project.org/web/views/HighPerformanceComputing.html and scroll down to Large memory and out-of-memory data. Some of the packages may have the functionality you are looking for and may do it faster than your code. If this doesn't help, you _may_ be able to make your code work, albeit slowly, if you replace the cbind() by data.frame. cbind() will in this case produce a matrix, and matrices are limited to 2^31 elements, which is less than 60000 times 60000. A data.frame is a special type of list and so _may_ be able to handle that many elements, given enough system RAM. There are experts on this list who will correct me if I'm wrong. If you are on a linux system, you can use split (type man split at the shell prompt to see help) to split the file into smaller chunks of say 5000 lines or so. Process each file separately, write it into a separate output file, then use the linux utility paste to "paste" the files side-by-side into the final output. Further, if you want to make it faster, do not grow geno_t by cbind'ing a new column to it in each iteration. Pre-allocate a matrix or data frame of an appropriate number of rows and columns and fill it out as you go. But it will still be slow, which I think is due to the inherent slowness of readLines and possibly strsplit. HTH, Peter ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.