On Wed, Mar 6, 2013 at 7:56 PM, Peter Langfelder <peter.langfel...@gmail.com> wrote: > On Wed, Mar 6, 2013 at 4:18 PM, Yao He <yao.h.1...@gmail.com> wrote: >> Dear all: >> >> I have a big data file of 60000 columns and 60000 rows like that: >> >> AA AC AA AA .......AT >> CC CC CT CT.......TC >> .......................... >> ......................... >> >> I want to transpose it and the output is a new like that >> AA CC ............ >> AC CC............ >> AA CT............. >> AA CT......... >> .................... >> .................... >> AT TC............. >> >> The keypoint is I can't read it into R by read.table() because the >> data is too large,so I try that: >> c<-file("silygenotype.txt","r") >> geno_t<-list() >> repeat{ >> line<-readLines(c,n=1) >> if (length(line)==0)break #end of file >> line<-unlist(strsplit(line,"\t")) >> geno_t<-cbind(geno_t,line) >> } >> write.table(geno_t,"xxx.txt") >> >> It works but it is too slow ,how to optimize it??? > > I hate to be negative, but this will also not work on a 60000x 60000 > matrix. At some point R will complain either about the lack of memory > or about you trying to allocate a vector that is too long.
Maybe this depends on the R version. I have not tried it, but the dev version of R can handle much larger vectors. See http://stat.ethz.ch/R-manual/R-devel/library/base/html/LongVectors.html Yau He, if you are feeling adventurous you could give the development version of R a try. Best, Ista > > I think your best bet is to look at file-backed data packages (for > example, package bigmemory). Look at this URL: > http://cran.r-project.org/web/views/HighPerformanceComputing.html and > scroll down to Large memory and out-of-memory data. Some of the > packages may have the functionality you are looking for and may do it > faster than your code. > > If this doesn't help, you _may_ be able to make your code work, albeit > slowly, if you replace the cbind() by data.frame. cbind() will in this > case produce a matrix, and matrices are limited to 2^31 elements, > which is less than 60000 times 60000. A data.frame is a special type > of list and so _may_ be able to handle that many elements, given > enough system RAM. There are experts on this list who will correct me > if I'm wrong. > > If you are on a linux system, you can use split (type man split at the > shell prompt to see help) to split the file into smaller chunks of say > 5000 lines or so. Process each file separately, write it into a > separate output file, then use the linux utility paste to "paste" the > files side-by-side into the final output. > > Further, if you want to make it faster, do not grow geno_t by > cbind'ing a new column to it in each iteration. Pre-allocate a matrix > or data frame of an appropriate number of rows and columns and fill it > out as you go. But it will still be slow, which I think is due to the > inherent slowness of readLines and possibly strsplit. > > HTH, > > Peter > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.