Did you check out the 'colbycol' package. On Fri, Mar 8, 2013 at 5:46 PM, Martin Morgan <mtmor...@fhcrc.org> wrote:
> On 03/08/2013 06:01 AM, Jan van der Laan wrote: > >> >> You could use the fact that scan reads the data rowwise, and the fact that >> arrays are stored columnwise: >> >> # generate a small example dataset >> exampl <- array(letters[1:25], dim=c(5,5)) >> write.table(exampl, file="example.dat", row.names=FALSE. col.names=FALSE, >> sep="\t", quote=FALSE) >> >> # and read... >> d <- scan("example.dat", what=character()) >> d <- array(d, dim=c(5,5)) >> >> t(exampl) == d >> >> >> Although this is probably faster, it doesn't help with the large size. >> You could >> used the n option of scan to read chunks/blocks and feed those to, for >> example, >> an ff array (which you ideally have preallocated). >> > > I think it's worth asking what the overall goal is; all we get from this > exercise is another large file that we can't easily manipulate in R! > > But nothing like a little challenge. The idea I think would be to > transpose in chunks of rows by scanning in some number of rows and writing > to a temporary file > > tpose1 <- function(fin, nrowPerChunk, ncol) { > v <- scan(fin, character(), nmax=ncol * nrowPerChunk) > m <- matrix(v, ncol=ncol, byrow=TRUE) > fout <- tempfile() > write(m, fout, nrow(m), append=TRUE) > fout > } > > Apparently the data is 60k x 60k, so we could maybe easily read 60k x 10k > at a time from some file fl <- "big.txt" > > ncol <- 60000L > nrowPerChunk <- 10000L > nChunks <- ncol / nrowPerChunk > > fin <- file(fl); open(fin) > fls <- replicate(nChunks, tpose1(fin, nrowPerChunk, ncol)) > close(fin) > > 'fls' is now a vector of file paths, each containing a transposed slice of > the matrix. The next task is to splice these together. We could do this by > taking a slice of rows from each file, cbind'ing them together, and writing > to an output > > splice <- function(fout, cons, nrowPerChunk, ncol) { > slices <- lapply(cons, function(con) { > v <- scan(con, character(), nmax=nrowPerChunk * ncol) > matrix(v, nrowPerChunk, byrow=TRUE) > }) > m <- do.call(cbind, slices) > write(t(m), fout, ncol(m), append=TRUE) > } > > We'd need to use open connections as inputs and output > > cons <- lapply(fls, file); for (con in cons) open(con) > fout <- file("big_transposed.txt"); open(fout, "w") > xx <- replicate(nChunks, splice(fout, cons, nrowPerChunk, > nrowPerChunk)) > for (con in cons) close(con) > close(fout) > > As another approach, it looks like the data are from genotypes. If they > really only consist of pairs of A, C, G, T, then two pairs e.g., 'AA' 'CT' > could be encoded as a single byte > > alf <- c("A", "C", "G", "T") > nms <- outer(alf, alf, paste0) > map <- outer(setNames(as.raw(0:15), nms), > setNames(as.raw(bitwShiftL(0:**15, 4)), nms), > "|") > > with e.g., > > > map[matrix(c("AA", "CT"), ncol=2)] > [1] d0 > > This translates the problem of representing the 60k x 60k array as a 3.6 > billion element vector of 60k * 60k * 8 bytes (approx. 30 Gbytes) to one of > 60k x 30k = 1.8 billion elements (fits in R-2.15 vectors) of approx 1.8 > Gbyte (probably usable in an 8 Gbyte laptop). > > Personally, I would probably put this data in a netcdf / rdf5 file. > Perhaps I'd use snpStats or GWAStools in Bioconductor > http://bioconductor.org. > > Martin > > >> HTH, >> >> Jan >> >> >> >> >> peter dalgaard <pda...@gmail.com> schreef: >> >> On Mar 7, 2013, at 01:18 , Yao He wrote: >>> >>> Dear all: >>>> >>>> I have a big data file of 60000 columns and 60000 rows like that: >>>> >>>> AA AC AA AA .......AT >>>> CC CC CT CT.......TC >>>> .......................... >>>> ......................... >>>> >>>> I want to transpose it and the output is a new like that >>>> AA CC ............ >>>> AC CC............ >>>> AA CT............. >>>> AA CT......... >>>> .................... >>>> .................... >>>> AT TC............. >>>> >>>> The keypoint is I can't read it into R by read.table() because the >>>> data is too large,so I try that: >>>> c<-file("silygenotype.txt","r"**) >>>> geno_t<-list() >>>> repeat{ >>>> line<-readLines(c,n=1) >>>> if (length(line)==0)break #end of file >>>> line<-unlist(strsplit(line,"\**t")) >>>> geno_t<-cbind(geno_t,line) >>>> } >>>> write.table(geno_t,"xxx.txt") >>>> >>>> It works but it is too slow ,how to optimize it??? >>>> >>> >>> >>> As others have pointed out, that's a lot of data! >>> >>> You seem to have the right idea: If you read the columns line by line >>> there is >>> nothing to transpose. A couple of points, though: >>> >>> - The cbind() is a potential performance hit since it copies the list >>> every >>> time around. geno_t <- vector("list", 60000) and then >>> geno_t[[i]] <- <etc> >>> >>> - You might use scan() instead of readLines, strsplit >>> >>> - Perhaps consider the data type as you seem to be reading strings with >>> 16 >>> possible values (I suspect that R already optimizes string storage to >>> make >>> this point moot, though.) >>> >>> -- >>> Peter Dalgaard, Professor >>> Center for Statistics, Copenhagen Business School >>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark >>> Phone: (+45)38153501 >>> Email: pd....@cbs.dk Priv: pda...@gmail.com >>> >>> ______________________________**________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help> >>> PLEASE do read the posting guide http://www.R-project.org/** >>> posting-guide.html <http://www.R-project.org/posting-guide.html> >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> ______________________________**________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help> >> PLEASE do read the posting guide http://www.R-project.org/** >> posting-guide.html <http://www.R-project.org/posting-guide.html> >> and provide commented, minimal, self-contained, reproducible code. >> > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > > ______________________________**________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help> > PLEASE do read the posting guide http://www.R-project.org/** > posting-guide.html <http://www.R-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.