On Mar 8, 2013, at 9:31 AM, David Winsemius wrote: > > On Mar 8, 2013, at 6:01 AM, Jan van der Laan wrote: > >> >> You could use the fact that scan reads the data rowwise, and the fact that >> arrays are stored columnwise: >> >> # generate a small example dataset >> exampl <- array(letters[1:25], dim=c(5,5)) >> write.table(exampl, file="example.dat", row.names=FALSE. col.names=FALSE, >> sep="\t", quote=FALSE) >> > > This might avoid creation of some of the intermediate copies: > > MASS::write.matrix( matrix( scan("example.dat", what=character()), 5,5), > file="fil.out") > > I tested it up to a 5000 x 5000 file: > >> exampl <- array(letters[1:25], dim=c(5000,5000)) >> MASS::write.matrix( matrix( scan("example.dat", what=character()), >> 5000,5000), file="fil.out") > Read 25000000 items >> > > Not sure of the exact timing. Probably 5-10 minutes. The exampl-object takes > 200,001,400 bytes. and did not noticeably stress my machine. Most of my RAM > remains untouched. I'm going out on errands and will run timing on a 10K x > 10K test case within a system.time() enclosure. Scan did report successfully > reading 100000000 items fairly promptly. >
> system.time( {MASS::write.matrix( matrix( scan("example.dat", > what=character()), 10000,10000), file="fil.out") } ) Read 100000000 items user system elapsed 487.100 912.613 1415.228 > system.time( {MASS::write.matrix( matrix( scan("example.dat", > what=character()), 500,500), file="fil.out") } ) Read 250000 items user system elapsed 1.184 2.507 3.834 And so it seems to scale linearly: > 3.834 * 100000000/250000 [1] 1533.6 > -- > David. > >> # and read... >> d <- scan("example.dat", what=character()) >> d <- array(d, dim=c(5,5)) >> >> t(exampl) == d >> >> >> Although this is probably faster, it doesn't help with the large size. You >> could used the n option of scan to read chunks/blocks and feed those to, for >> example, an ff array (which you ideally have preallocated). >> >> HTH, >> >> Jan >> >> >> >> >> peter dalgaard <pda...@gmail.com> schreef: >> >>> On Mar 7, 2013, at 01:18 , Yao He wrote: >>> >>>> Dear all: >>>> >>>> I have a big data file of 60000 columns and 60000 rows like that: >>>> >>>> AA AC AA AA .......AT >>>> CC CC CT CT.......TC >>>> .......................... >>>> ......................... >>>> >>>> I want to transpose it and the output is a new like that >>>> AA CC ............ >>>> AC CC............ >>>> AA CT............. >>>> AA CT......... >>>> .................... >>>> .................... >>>> AT TC............. >>>> >>>> The keypoint is I can't read it into R by read.table() because the >>>> data is too large,so I try that: >>>> c<-file("silygenotype.txt","r") >>>> geno_t<-list() >>>> repeat{ >>>> line<-readLines(c,n=1) >>>> if (length(line)==0)break #end of file >>>> line<-unlist(strsplit(line,"\t")) >>>> geno_t<-cbind(geno_t,line) >>>> } >>>> write.table(geno_t,"xxx.txt") >>>> >>>> It works but it is too slow ,how to optimize it??? >>> >>> >>> As others have pointed out, that's a lot of data! >>> >>> You seem to have the right idea: If you read the columns line by line there >>> is nothing to transpose. A couple of points, though: >>> >>> - The cbind() is a potential performance hit since it copies the list every >>> time around. geno_t <- vector("list", 60000) and then >>> geno_t[[i]] <- <etc> >>> >>> - You might use scan() instead of readLines, strsplit >>> >>> - Perhaps consider the data type as you seem to be reading strings with 16 >>> possible values (I suspect that R already optimizes string storage to make >>> this point moot, though.) >>> >>> -- >>> Peter Dalgaard, Professor >>> Center for Statistics, Copenhagen Business School >>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark >>> Phone: (+45)38153501 >>> Email: pd....@cbs.dk Priv: pda...@gmail.com >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > David Winsemius > Alameda, CA, USA > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.