Thanks for everybody's help! I learn a lot from this discuss!
2013/3/10 jim holtman <jholt...@gmail.com>: > Did you check out the 'colbycol' package. > > On Fri, Mar 8, 2013 at 5:46 PM, Martin Morgan <mtmor...@fhcrc.org> wrote: > >> On 03/08/2013 06:01 AM, Jan van der Laan wrote: >> >>> >>> You could use the fact that scan reads the data rowwise, and the fact that >>> arrays are stored columnwise: >>> >>> # generate a small example dataset >>> exampl <- array(letters[1:25], dim=c(5,5)) >>> write.table(exampl, file="example.dat", row.names=FALSE. col.names=FALSE, >>> sep="\t", quote=FALSE) >>> >>> # and read... >>> d <- scan("example.dat", what=character()) >>> d <- array(d, dim=c(5,5)) >>> >>> t(exampl) == d >>> >>> >>> Although this is probably faster, it doesn't help with the large size. >>> You could >>> used the n option of scan to read chunks/blocks and feed those to, for >>> example, >>> an ff array (which you ideally have preallocated). >>> >> >> I think it's worth asking what the overall goal is; all we get from this >> exercise is another large file that we can't easily manipulate in R! >> >> But nothing like a little challenge. The idea I think would be to >> transpose in chunks of rows by scanning in some number of rows and writing >> to a temporary file >> >> tpose1 <- function(fin, nrowPerChunk, ncol) { >> v <- scan(fin, character(), nmax=ncol * nrowPerChunk) >> m <- matrix(v, ncol=ncol, byrow=TRUE) >> fout <- tempfile() >> write(m, fout, nrow(m), append=TRUE) >> fout >> } >> >> Apparently the data is 60k x 60k, so we could maybe easily read 60k x 10k >> at a time from some file fl <- "big.txt" >> >> ncol <- 60000L >> nrowPerChunk <- 10000L >> nChunks <- ncol / nrowPerChunk >> >> fin <- file(fl); open(fin) >> fls <- replicate(nChunks, tpose1(fin, nrowPerChunk, ncol)) >> close(fin) >> >> 'fls' is now a vector of file paths, each containing a transposed slice of >> the matrix. The next task is to splice these together. We could do this by >> taking a slice of rows from each file, cbind'ing them together, and writing >> to an output >> >> splice <- function(fout, cons, nrowPerChunk, ncol) { >> slices <- lapply(cons, function(con) { >> v <- scan(con, character(), nmax=nrowPerChunk * ncol) >> matrix(v, nrowPerChunk, byrow=TRUE) >> }) >> m <- do.call(cbind, slices) >> write(t(m), fout, ncol(m), append=TRUE) >> } >> >> We'd need to use open connections as inputs and output >> >> cons <- lapply(fls, file); for (con in cons) open(con) >> fout <- file("big_transposed.txt"); open(fout, "w") >> xx <- replicate(nChunks, splice(fout, cons, nrowPerChunk, >> nrowPerChunk)) >> for (con in cons) close(con) >> close(fout) >> >> As another approach, it looks like the data are from genotypes. If they >> really only consist of pairs of A, C, G, T, then two pairs e.g., 'AA' 'CT' >> could be encoded as a single byte >> >> alf <- c("A", "C", "G", "T") >> nms <- outer(alf, alf, paste0) >> map <- outer(setNames(as.raw(0:15), nms), >> setNames(as.raw(bitwShiftL(0:**15, 4)), nms), >> "|") >> >> with e.g., >> >> > map[matrix(c("AA", "CT"), ncol=2)] >> [1] d0 >> >> This translates the problem of representing the 60k x 60k array as a 3.6 >> billion element vector of 60k * 60k * 8 bytes (approx. 30 Gbytes) to one of >> 60k x 30k = 1.8 billion elements (fits in R-2.15 vectors) of approx 1.8 >> Gbyte (probably usable in an 8 Gbyte laptop). >> >> Personally, I would probably put this data in a netcdf / rdf5 file. >> Perhaps I'd use snpStats or GWAStools in Bioconductor >> http://bioconductor.org. >> >> Martin >> >> >>> HTH, >>> >>> Jan >>> >>> >>> >>> >>> peter dalgaard <pda...@gmail.com> schreef: >>> >>> On Mar 7, 2013, at 01:18 , Yao He wrote: >>>> >>>> Dear all: >>>>> >>>>> I have a big data file of 60000 columns and 60000 rows like that: >>>>> >>>>> AA AC AA AA .......AT >>>>> CC CC CT CT.......TC >>>>> .......................... >>>>> ......................... >>>>> >>>>> I want to transpose it and the output is a new like that >>>>> AA CC ............ >>>>> AC CC............ >>>>> AA CT............. >>>>> AA CT......... >>>>> .................... >>>>> .................... >>>>> AT TC............. >>>>> >>>>> The keypoint is I can't read it into R by read.table() because the >>>>> data is too large,so I try that: >>>>> c<-file("silygenotype.txt","r"**) >>>>> geno_t<-list() >>>>> repeat{ >>>>> line<-readLines(c,n=1) >>>>> if (length(line)==0)break #end of file >>>>> line<-unlist(strsplit(line,"\**t")) >>>>> geno_t<-cbind(geno_t,line) >>>>> } >>>>> write.table(geno_t,"xxx.txt") >>>>> >>>>> It works but it is too slow ,how to optimize it??? >>>>> >>>> >>>> >>>> As others have pointed out, that's a lot of data! >>>> >>>> You seem to have the right idea: If you read the columns line by line >>>> there is >>>> nothing to transpose. A couple of points, though: >>>> >>>> - The cbind() is a potential performance hit since it copies the list >>>> every >>>> time around. geno_t <- vector("list", 60000) and then >>>> geno_t[[i]] <- <etc> >>>> >>>> - You might use scan() instead of readLines, strsplit >>>> >>>> - Perhaps consider the data type as you seem to be reading strings with >>>> 16 >>>> possible values (I suspect that R already optimizes string storage to >>>> make >>>> this point moot, though.) >>>> >>>> -- >>>> Peter Dalgaard, Professor >>>> Center for Statistics, Copenhagen Business School >>>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark >>>> Phone: (+45)38153501 >>>> Email: pd....@cbs.dk Priv: pda...@gmail.com >>>> >>>> ______________________________**________________ >>>> R-help@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help> >>>> PLEASE do read the posting guide http://www.R-project.org/** >>>> posting-guide.html <http://www.R-project.org/posting-guide.html> >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>> >>> ______________________________**________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help> >>> PLEASE do read the posting guide http://www.R-project.org/** >>> posting-guide.html <http://www.R-project.org/posting-guide.html> >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> >> -- >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 >> >> ______________________________**________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help> >> PLEASE do read the posting guide http://www.R-project.org/** >> posting-guide.html <http://www.R-project.org/posting-guide.html> >> and provide commented, minimal, self-contained, reproducible code. >> > > > > -- > Jim Holtman > Data Munger Guru > > What is the problem that you are trying to solve? > Tell me what you want to do, not how you want to do it. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- ————————————————————————— Master candidate in 2rd year Department of Animal genetics & breeding Room 436,College of Animial Science&Technology, China Agriculture University,Beijing,100193 E-mail: yao.h.1...@gmail.com —————————————————————————— ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.